Search

Sunday, 9 July 2023

How to connect to an Oracle database using PySpark

 from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder \

    .appName("Oracle Connection") \

    .config("spark.jars.packages", "oracle:oracle-jdbc:8.2.0.0") \

    .getOrCreate()

# Define the connection properties

url = "jdbc:oracle:thin:@//hostname:port/service_name"

properties = {

    "user": "your_username",

    "password": "your_password"

}

# Read data from Oracle table

df = spark.read.jdbc(url=url, table="your_table_name", properties=properties)

# Display the dataframe

df.show()

# Perform further transformations or analysis on the dataframe as needed

# Stop the SparkSession

spark.stop()

In the code above, we create a SparkSession and provide the necessary Oracle JDBC driver dependency using the spark.jars.packages configuration. Make sure to replace the hostname, port, and service_name in the url variable with your actual Oracle database connection details.

Next, define the properties dictionary with your Oracle database credentials (user and password).

To read data from a specific table in the Oracle database, use the read.jdbc() method, passing the url, table, and properties as arguments.

After reading the data, you can perform further transformations or analysis on the resulting DataFrame object df. Finally, stop the SparkSession using spark.stop().

Ensure that you have the Oracle JDBC driver accessible to your PySpark environment by either providing the driver JAR file explicitly or using a package management system like Maven or Gradle.