Search

Sunday 9 July 2023

To read data from a Hive table using PySpark

 from pyspark.sql import SparkSession

# Create a SparkSession with Hive support

spark = SparkSession.builder \

    .appName("Hive Read") \

    .enableHiveSupport() \

    .getOrCreate()

# Read data from a Hive table

df = spark.sql("SELECT * FROM your_hive_table")

# Display the dataframe

df.show()

# Perform further transformations or analysis on the dataframe as needed

# Stop the SparkSession

spark.stop()


In the code above, we create a SparkSession with Hive support by calling .enableHiveSupport() during the session creation.

To read data from a Hive table, you can use the spark.sql() method and pass a SQL query to select the desired data from the table. Replace "your_hive_table" in the SQL query with the actual name of your Hive table.

After reading the data, you can perform further transformations or analysis on the resulting DataFrame object df. Finally, stop the SparkSession using spark.stop().

Ensure that your Spark cluster is properly configured to work with Hive, and the necessary Hive metastore configuration is set up correctly.