Search

Sunday, 22 January 2023

How Spark Job runs behind the scenes

  1. The driver program divides the dataset into partitions and assigns each partition to an executor. 
  2. Each executor initializes its memory and reads the data in its partition from the cluster. 
  3. It then executes the tasks assigned to it, writing any intermediate output to disk. 
  4. When the task is finished, the executor returns the result back to the driver program. 
  5. The driver program then aggregates the results of the tasks and distributes them to the other nodes in the cluster.
  6. Finally, the driver program collects the output and writes it to persistent storage.

How do you increase mappers in Hadoop

 To increase the number of mappers in Hadoop, you can increase the number of input splits by setting the mapreduce.input.fileinputformat.split.maxsize configuration parameter. Increasing this value can increase the number of mappers used by the application, but it can also lead to decreased performance if the splits are too large. The default value for mapreduce.input.fileinputformat.split.maxsize is 128 MB.

You can also tweak the size of each split and memory allocations for the application to optimize its performance.


Deprecated property: mapred.min.split.size


Scala vs Python (PySpark) with Scala, which is better?

Which is a better language for Spark programming, Data Engineering: Scala or Python?

It depends on your specific needs. Both Python and Scala are popular languages for data engineering and spark programming, but each have their own advantages. Python is often seen as the easier language to learn due to its simple syntax and intuitive design. Additionally, its extensive library of modules and open-source libraries make it useful for developing real-time applications. On the other hand, Scala is often seen as being more powerful and performant than Python due to its statically-typed nature and ability to easily integrate with Java. Ultimately, choosing the best language for data engineering depends on your individual requirements and preferences.

Is Scala still relevant?

Yes, Scala is still relevant. Although it has seen a decline in its popularity over the years, Scala is still actively used and maintained by many organizations. Its advantages, such as its ability to seamlessly integrate with Java and its functional programming capabilities, make it a valuable language in data engineering and analytics fields.

Why to use Pyspark instead of Scala with Spark?

PySpark is an interface for Python programming with Apache Spark and provides advantages such as allowing developers to write Python code in a Spark environment, as well as making it easier for those not familiar with Scala to work with Spark. Additionally, PySpark makes data processing faster, which can be useful for applications that require real-time analytics.

Why to use Scala with Spark instead of PySpark?

Scala is the most popular language for working with Apache Spark because of its ability to provide high performance and scalability. Scala also supports functional programming, which can be useful when dealing with large datasets as it allows developers to write concise and efficient code. Additionally, since Scala is a statically-typed language, it can be compiled more quickly, which leads to faster execution times.

Are data structures for Big Data / Data Engineering interviews easier in Python or Scala?

While both Python and Scala have their own advantages when it comes to programming data structures, many developers find Python to be slightly easier to work with due to its simpler syntax and more intuitive design. Additionally, since Python is a dynamically-typed language, it can be easier to write code for data structures quickly. However, Scala may be better suited for more complex data structures such as trees, graphs, or heaps, as it can offer more control over the structure and speed of data manipulation.

What is difference between Hive Metastore and MySQL Metastore?

Hive Metastore is a data storage layer that stores the structure of the data in a Hive warehouse. It acts as a central repository for metadata and also helps in providing access to external tools that need to access the Hive tables.

While MySQL Metastore is a type of database that is used to store Hive data such as table definitions, partitions, and statistics. Additionally, MySQL Metastore provides an interface to users to manage, query, and access the data stored in the Hive warehouse, while Hive Metastore does not have this capability.