from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
.appName("Oracle Connection") \
.config("spark.jars.packages", "oracle:oracle-jdbc:8.2.0.0") \
.getOrCreate()
# Define the connection properties
url = "jdbc:oracle:thin:@//hostname:port/service_name"
properties = {
"user": "your_username",
"password": "your_password"
}
# Read data from Oracle table
df = spark.read.jdbc(url=url, table="your_table_name", properties=properties)
# Display the dataframe
df.show()
# Perform further transformations or analysis on the dataframe as needed
# Stop the SparkSession
spark.stop()
In the code above, we create a SparkSession and provide the necessary Oracle JDBC driver dependency using the spark.jars.packages configuration. Make sure to replace the hostname, port, and service_name in the url variable with your actual Oracle database connection details.
Next, define the properties dictionary with your Oracle database credentials (user and password).
To read data from a specific table in the Oracle database, use the read.jdbc() method, passing the url, table, and properties as arguments.
After reading the data, you can perform further transformations or analysis on the resulting DataFrame object df. Finally, stop the SparkSession using spark.stop().
Ensure that you have the Oracle JDBC driver accessible to your PySpark environment by either providing the driver JAR file explicitly or using a package management system like Maven or Gradle.