Search

Sunday 22 January 2023

How Spark Job runs behind the scenes

  1. The driver program divides the dataset into partitions and assigns each partition to an executor. 
  2. Each executor initializes its memory and reads the data in its partition from the cluster. 
  3. It then executes the tasks assigned to it, writing any intermediate output to disk. 
  4. When the task is finished, the executor returns the result back to the driver program. 
  5. The driver program then aggregates the results of the tasks and distributes them to the other nodes in the cluster.
  6. Finally, the driver program collects the output and writes it to persistent storage.