Spark Theory for Data Engineers
Get started with the foundational topics of Spark for Data Engineering.
Tutorial #1
⚡Introduction to Apache Spark
Understand what Apache Spark is, why it is used, and how it works at a high level.
Tutorial #2
⚙️Spark Architecture
Learn how Spark’s Driver, Executors, and Cluster Manager work together to execute distributed jobs efficiently.
Tutorial #3
🔄Transformations & Actions
Understand how Spark transformations build lazy execution plans and how actions trigger job execution.
Tutorial #4
🧱Resilient Distributed Dataset
Understand what RDDs are, how they work under the hood.
Tutorial #5
📊DataFrames & Datasets
Learn how DataFrames provide a higher-level abstraction with schema enforcement and optimization.
Tutorial #6
⏳Lazy Evaluation
Learn how Spark optimizes performance by delaying execution until an action is triggered.
Tutorial #7
🚀The Catalyst Optimizer
Understand how the query optimizer transforms logical plans into efficient physical execution plans.
Tutorial #8
⚙️Jobs, Stages, and Tasks
Understand how Spark jobs, stages, and tasks are executed and how they work together to complete a job.
Tutorial #9
🔀Join Strategies
Understand how Spark's join strategies work and how they are used to optimize join performance.
Tutorial #10
🔄Adaptive Query Execution (AQE)
Understand how Spark's AQE dynamically re-optimizes queries on the fly using runtime statistics.
Tutorial #11
📄Common File Formats
Understand how Spark's common file formats work and when to use them.
Tutorial #12
🗂️Partitioning and Bucketing
Understand how Spark's partitioning and bucketing work and how they are used to optimize data storage and retrieval.
Tutorial #13
🔄Repartition and Coalesce
Understand how Spark's repartition and coalesce work and how they are used to optimize data pipelines.
Tutorial #14
🖥️Executor Memory
Understand how Spark's executor memory works and how it is used to optimize data storage and retrieval.
Tutorial #15
✂️Dynamic Partition Pruning
Understand how Spark's dynamic partition pruning works and how it is used to optimize data reads.
Tutorial #16
⚙️Dynamic Resource Allocation
Understand how Spark's dynamic resource allocation works and how it is used to optimize Spark jobs.
Tutorial #17
➕More Topics Coming Soon
Additional Spark concepts, internals, best practices and deep dives will be added soon.