- The driver program divides the dataset into partitions and assigns each partition to an executor.
- Each executor initializes its memory and reads the data in its partition from the cluster.
- It then executes the tasks assigned to it, writing any intermediate output to disk.
- When the task is finished, the executor returns the result back to the driver program.
- The driver program then aggregates the results of the tasks and distributes them to the other nodes in the cluster.
- Finally, the driver program collects the output and writes it to persistent storage.
A personal weblog about all things related to tech and programming.
Search
Sunday, 22 January 2023
How Spark Job runs behind the scenes
How do you increase mappers in Hadoop
To increase the number of mappers in Hadoop, you can increase the number of input splits by setting the mapreduce.input.fileinputformat.split.maxsize configuration parameter. Increasing this value can increase the number of mappers used by the application, but it can also lead to decreased performance if the splits are too large. The default value for mapreduce.input.fileinputformat.split.maxsize is 128 MB.
You can also tweak the size of each split and memory allocations for the application to optimize its performance.
Deprecated property: mapred.min.split.size
Scala vs Python (PySpark) with Scala, which is better?
Which is a better language for Spark programming, Data Engineering: Scala
or Python?
It depends on your specific needs. Both Python and Scala are
popular languages for data engineering and spark programming, but each have
their own advantages. Python is often seen as the easier language to learn due
to its simple syntax and intuitive design. Additionally, its extensive library
of modules and open-source libraries make it useful for developing real-time
applications. On the other hand, Scala is often seen as being more powerful and
performant than Python due to its statically-typed nature and ability to easily
integrate with Java. Ultimately, choosing the best language for data
engineering depends on your individual requirements and preferences.
Is Scala still relevant?
Yes, Scala is still relevant. Although it has seen a decline
in its popularity over the years, Scala is still actively used and maintained
by many organizations. Its advantages, such as its ability to seamlessly
integrate with Java and its functional programming capabilities, make it a
valuable language in data engineering and analytics fields.
Why to use Pyspark instead of Scala with Spark?
PySpark is an interface for Python programming with Apache
Spark and provides advantages such as allowing developers to write Python code
in a Spark environment, as well as making it easier for those not familiar with
Scala to work with Spark. Additionally, PySpark makes data processing faster,
which can be useful for applications that require real-time analytics.
Why to use Scala with Spark instead of PySpark?
Scala is the most popular language for working with Apache
Spark because of its ability to provide high performance and scalability. Scala
also supports functional programming, which can be useful when dealing with
large datasets as it allows developers to write concise and efficient code.
Additionally, since Scala is a statically-typed language, it can be compiled
more quickly, which leads to faster execution times.
Are data structures for Big Data / Data Engineering interviews easier in
Python or Scala?
While both Python and Scala have their own advantages when it comes to programming data structures, many developers find Python to be slightly easier to work with due to its simpler syntax and more intuitive design. Additionally, since Python is a dynamically-typed language, it can be easier to write code for data structures quickly. However, Scala may be better suited for more complex data structures such as trees, graphs, or heaps, as it can offer more control over the structure and speed of data manipulation.
What is difference between Hive Metastore and MySQL Metastore?
Hive Metastore is a data storage layer that stores the structure of the data in a Hive warehouse. It acts as a central repository for metadata and also helps in providing access to external tools that need to access the Hive tables.
While MySQL Metastore is a type of database that is used to store Hive data such as table definitions, partitions, and statistics. Additionally, MySQL Metastore provides an interface to users to manage, query, and access the data stored in the Hive warehouse, while Hive Metastore does not have this capability.