The Github Experiment -By Ajay Gupta: How Spark Job runs behind the scenes

Sunday, 22 January 2023

The driver program divides the dataset into partitions and assigns each partition to an executor.
Each executor initializes its memory and reads the data in its partition from the cluster.
It then executes the tasks assigned to it, writing any intermediate output to disk.
When the task is finished, the executor returns the result back to the driver program.
The driver program then aggregates the results of the tasks and distributes them to the other nodes in the cluster.
Finally, the driver program collects the output and writes it to persistent storage.

Search