Spark Theory for Data Engineers

Get started with the foundational topics of Spark for Data Engineering.

Tutorial #1

⚡Introduction to Apache Spark

Understand what Apache Spark is, why it is used, and how it works at a high level.

Tutorial #2

⚙️Spark Architecture

Learn how Spark’s Driver, Executors, and Cluster Manager work together to execute distributed jobs efficiently.

Tutorial #3

🔄Transformations & Actions

Understand how Spark transformations build lazy execution plans and how actions trigger job execution.

Tutorial #4

🧱Resilient Distributed Dataset

Understand what RDDs are, how they work under the hood.

Tutorial #5

📊DataFrames & Datasets

Learn how DataFrames provide a higher-level abstraction with schema enforcement and optimization.

Tutorial #6

⏳Lazy Evaluation

Learn how Spark optimizes performance by delaying execution until an action is triggered.

Tutorial #7

🚀The Catalyst Optimizer

Understand how the query optimizer transforms logical plans into efficient physical execution plans.

Tutorial #8

⚙️Jobs, Stages, and Tasks

Understand how Spark jobs, stages, and tasks are executed and how they work together to complete a job.

Tutorial #9

🔀Join Strategies

Understand how Spark's join strategies work and how they are used to optimize join performance.

Tutorial #10

🔄Adaptive Query Execution (AQE)

Understand how Spark's AQE dynamically re-optimizes queries on the fly using runtime statistics.

Tutorial #11

📄Common File Formats

Understand how Spark's common file formats work and when to use them.

Tutorial #12

🗂️Partitioning and Bucketing

Understand how Spark's partitioning and bucketing work and how they are used to optimize data storage and retrieval.

Tutorial #13

🔄Repartition and Coalesce

Understand how Spark's repartition and coalesce work and how they are used to optimize data pipelines.

Tutorial #14

🖥️Executor Memory

Understand how Spark's executor memory works and how it is used to optimize data storage and retrieval.

Tutorial #15

✂️Dynamic Partition Pruning

Understand how Spark's dynamic partition pruning works and how it is used to optimize data reads.

Tutorial #16

⚙️Dynamic Resource Allocation

Understand how Spark's dynamic resource allocation works and how it is used to optimize Spark jobs.

Tutorial #17

➕More Topics Coming Soon

Additional Spark concepts, internals, best practices and deep dives will be added soon.

Coming Soon