How Spark Job runs behind the scenes
- The driver program divides the dataset into partitions and assigns each partition to an executor.
- Each executor initializes its memory and reads the data in its partition from the cluster.
- It then executes the tasks assigned to it, writing any intermediate output to disk.
- When the task is finished, the executor returns the result back to the driver program.
- The driver program then aggregates the results of the tasks and distributes them to the other nodes in the cluster.
- Finally, the driver program collects the output and writes it to persistent storage.