Search

Sunday, 9 July 2023

How to use multiple DataFrames as tables and perform transformations using SQL in PySpark

 from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder \

    .appName("Multiple DataFrame SQL") \

    .getOrCreate()

# Assume we have three DataFrames: df1, df2, and df3

# Register df1 as a temporary table

df1.createOrReplaceTempView("table1")

# Register df2 as a temporary table

df2.createOrReplaceTempView("table2")

# Register df3 as a temporary table

df3.createOrReplaceTempView("table3")

# Execute SQL queries on the registered tables

result = spark.sql("""

    SELECT t1.col1, t2.col2, t3.col3

    FROM table1 t1

    JOIN table2 t2 ON t1.id = t2.id

    JOIN table3 t3 ON t2.id = t3.id

    WHERE t1.col1 > 10

""")

# Display the result

result.show()

# Perform further transformations or analysis on the result as needed

# Stop the SparkSession

spark.stop()


In the code above, we assume you have three DataFrames: df1, df2, and df3. Each DataFrame is registered as a temporary table using the createOrReplaceTempView() method, assigning them unique table names: table1, table2, and table3.

Next, you can execute SQL queries against these registered tables using the spark.sql() method. In the example, we perform a JOIN operation on the tables and apply a filter condition.

After executing the SQL query, you can work with the result as a new DataFrame (result) and perform further transformations or analysis as needed.

Finally, stop the SparkSession using spark.stop().

Ensure that you have created the DataFrames df1, df2, and df3 before registering them as tables and executing SQL queries against them.