Search

Sunday, 9 July 2023

PySpark code to read data from a CSV file

 from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder.appName("CSV Read").getOrCreate()

# Read the CSV file

df = spark.read.csv("path/to/your/file.csv", header=True, inferSchema=True)

# Display the dataframe

df.show()

# Perform further transformations or analysis on the dataframe as needed

# Stop the SparkSession

spark.stop()

In the code above, we import the necessary modules from pyspark.sql, create a SparkSession, and then use the read.csv() method to read the CSV file. The header=True parameter indicates that the first row of the CSV file contains the column names. The inferSchema=True parameter tells Spark to infer the data types of the columns.

After reading the CSV file, you can perform additional transformations, filtering, or analysis on the DataFrame object df. Finally, you can stop the SparkSession using spark.stop().

Make sure to replace "path/to/your/file.csv" with the actual path to your CSV file.