The Github Experiment -By Ajay Gupta: July 2023

Wednesday 19 July 2023

modularize your PySpark code that includes SQL queries based on createOrReplaceTempView

To modularize your PySpark code that includes SQL queries based on createOrReplaceTempView, you can follow a few steps:

Create separate Python modules for different components of your code.

Define functions in each module to encapsulate the logic related to specific SQL queries and transformations.

Import the necessary modules and call the functions in your main code.

In this example, the create_temp_views function in queries.py encapsulates the logic to read CSV files, create temporary views using createOrReplaceTempView, and perform a join operation. By separating the SQL-related code into a separate module, you can easily reuse and maintain your queries.

Note that you need to replace the file paths and SQL queries in the example with your specific requirements. Additionally, you might need to adjust the imports and customize the functions according to your actual codebase.

Create a module named queries.py to define functions for your SQL queries:

# queries.py

from pyspark.sql import SparkSession

def create_temp_views(spark):

# Register temp views using createOrReplaceTempView

df1 = spark.read.csv("data/file1.csv", header=True)

df1.createOrReplaceTempView("table1")

df2 = spark.read.csv("data/file2.csv", header=True)

df2.createOrReplaceTempView("table2")

df3 = spark.sql("SELECT * FROM table1 JOIN table2 ON table1.id = table2.id")

df3.createOrReplaceTempView("table3")

# Return the SparkSession for further use, if needed

return spark

Create your main script and import the queries module to call the functions:

# main.py

from pyspark.sql import SparkSession

import queries

# Create a SparkSession

spark = SparkSession.builder.appName("Modularized PySpark").getOrCreate()

# Call the create_temp_views function from queries module

spark = queries.create_temp_views(spark)

# Use the registered temp views for further processing

result = spark.sql("SELECT * FROM table3")

result.show()

# Stop the SparkSession

spark.stop()

Best practices for organizing PySpark code with SQL usage

Example illustrating the best practices for organizing PySpark code with extensive SQL usage:

In this example, the SQL code is stored in separate SQL files within the sql/ directory. The data_preprocessing.sql and feature_engineering.sql files encapsulate the respective SQL operations.

The feature_engineering.sql file imports common functions from common_functions.sql within the utils/ subdirectory. This approach promotes code reusability and separates the common functions from the main SQL code.

The main.py script reads the SQL files and executes the SQL code using spark.sql(). By following this approach, you maintain a clean separation between the PySpark code and SQL logic, making it easier to manage and maintain your project.

Directory Structure:

project/

├── main.py

└── sql/

├── data_preprocessing.sql

├── feature_engineering.sql

└── utils/

├── common_functions.sql

└── feature_functions.sql

data_preprocessing.sql:

-- data_preprocessing.sql

-- Define the SQL operations for data preprocessing

CREATE OR REPLACE TEMPORARY VIEW cleaned_data AS

SELECT

column1,

column2,

-- Perform data preprocessing operations here

FROM

input_data;

feature_engineering.sql:

-- feature_engineering.sql

-- Import common functions

:load sql/utils/common_functions.sql

-- Define the SQL operations for feature engineering

CREATE OR REPLACE TEMPORARY VIEW transformed_data AS

SELECT

column1,

column2,

-- Perform feature engineering operations here

-- Example using a custom feature function

feature_function(column1) AS new_feature

FROM

cleaned_data;

common_functions.sql (in sql/utils/):

-- common_functions.sql

-- Define common SQL functions used across multiple SQL files

CREATE OR REPLACE FUNCTION feature_function(column)

RETURNS feature

BEGIN

-- Implement the feature function logic here

-- ...

RETURN feature_value;

END;

main.py:

from pyspark.sql import SparkSession

# Create SparkSession

spark = SparkSession.builder.getOrCreate()

# Load input data

input_data = spark.read.csv("data.csv", header=True, inferSchema=True)

input_data.createOrReplaceTempView("input_data")

# Load and execute SQL scripts

with open("sql/data_preprocessing.sql") as f:

data_preprocessing_sql = f.read()

spark.sql(data_preprocessing_sql)

with open("sql/feature_engineering.sql") as f:

feature_engineering_sql = f.read()

spark.sql(feature_engineering_sql)

# Retrieve the transformed data

transformed_data = spark.table("transformed_data")

# Perform further operations on transformed_data or train models, etc.

How to modularize your PySpark code

Code example to help you understand how to modularize your PySpark code:

Let's assume we have a PySpark project for text classification. We'll modularize the code into three modules: data_preprocessing.py, feature_engineering.py, and model_training.py.

In this example, each module (data_preprocessing.py, feature_engineering.py, and model_training.py) encapsulates a specific functionality and contains related functions. The main script imports these modules and calls the functions to execute the desired tasks. This modular approach makes the code more organized, reusable, and easier to maintain.

data_preprocessing.py module:

from pyspark.sql import SparkSession

from pyspark.sql.functions import col, lower, regexp_replace

def load_data(spark, input_path):

# Load data from input path and return DataFrame

data = spark.read.csv(input_path, header=True, inferSchema=True)

return data

def clean_text(data, text_column):

# Perform text cleaning operations on the specified text column

cleaned_data = data.withColumn("cleaned_text", regexp_replace(col(text_column), "[^a-zA-Z\\s]", ""))

return cleaned_data

feature_engineering.py module:

from pyspark.ml.feature import CountVectorizer, StringIndexer

from pyspark.ml import Pipeline

def create_features(data, text_column):

# Create features using CountVectorizer and StringIndexer

vectorizer = CountVectorizer(inputCol=text_column, outputCol="features")

indexer = StringIndexer(inputCol="label", outputCol="label_index")

pipeline = Pipeline(stages=[vectorizer, indexer])

transformed_data = pipeline.fit(data).transform(data)

return transformed_data

model_training.py module:

from pyspark.ml.classification import RandomForestClassifier

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

def train_model(data):

# Train a random forest classifier on the input data

rf = RandomForestClassifier(labelCol="label_index", featuresCol="features")

model = rf.fit(data)

return model

def evaluate_model(model, data):

# Evaluate the trained model on the input data

evaluator = MulticlassClassificationEvaluator(labelCol="label_index", predictionCol="prediction", metricName="accuracy")

accuracy = evaluator.evaluate(model.transform(data))

return accuracy

Main script:

from pyspark.sql import SparkSession

from data_preprocessing import load_data, clean_text

from feature_engineering import create_features

from model_training import train_model, evaluate_model

# Create SparkSession

spark = SparkSession.builder.getOrCreate()

# Load data

data = load_data(spark, "data.csv")

# Clean text

cleaned_data = clean_text(data, "text_column")

# Create features

transformed_data = create_features(cleaned_data, "cleaned_text")

# Train model

model = train_model(transformed_data)

# Evaluate model

accuracy = evaluate_model(model, transformed_data)

print("Accuracy:", accuracy)

How you can modularize PySpark code that involves SQL operations using SQL files

an example of how you can modularize PySpark code that involves SQL operations using SQL files:

Create a directory structure for your project:

project/

├── main.py

└── sql/

├── data_preprocessing.sql

└── feature_engineering.sql

data_preprocessing.sql:

-- data_preprocessing.sql

-- Define the SQL operations for data preprocessing

CREATE OR REPLACE TEMPORARY VIEW cleaned_data AS

SELECT

column1,

column2,

-- Perform data preprocessing operations here

FROM

input_data;

feature_engineering.sql:

-- feature_engineering.sql

-- Define the SQL operations for feature engineering

CREATE OR REPLACE TEMPORARY VIEW transformed_data AS

SELECT

column1,

column2,

-- Perform feature engineering operations here

FROM

cleaned_data;

main.py:

from pyspark.sql import SparkSession

# Create SparkSession

spark = SparkSession.builder.getOrCreate()

# Load input data

input_data = spark.read.csv("data.csv", header=True, inferSchema=True)

input_data.createOrReplaceTempView("input_data")

# Load and execute SQL scripts

with open("sql/data_preprocessing.sql") as f:

data_preprocessing_sql = f.read()

spark.sql(data_preprocessing_sql)

with open("sql/feature_engineering.sql") as f:

feature_engineering_sql = f.read()

spark.sql(feature_engineering_sql)

# Retrieve the transformed data

transformed_data = spark.table("transformed_data")

In this example, the SQL code for data preprocessing and feature engineering is stored in separate SQL files (data_preprocessing.sql and feature_engineering.sql). The main.py script reads these SQL files, executes the SQL code using spark.sql(), and creates temporary views for intermediate data.

You can organize and structure your SQL files based on the operations you want to perform, and then include them in your PySpark code as needed. This approach separates the SQL logic from the Python code, making it easier to manage and maintain complex SQL operations within your PySpark project.

Sunday 9 July 2023

How to read data from an Oracle database table using PySpark

from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder \

.appName("Oracle Connection") \

.config("spark.jars.packages", "oracle:oracle-jdbc:8.2.0.0") \

.getOrCreate()

# Define the connection properties

url = "jdbc:oracle:thin:@//hostname:port/service_name"

properties = {

"user": "your_username",

"password": "your_password"

}

# Read data from Oracle table

df = spark.read.jdbc(url=url, table="your_table_name", properties=properties)

# Display the dataframe

df.show()

# Perform further transformations or analysis on the dataframe as needed

# Stop the SparkSession

spark.stop()

In the code above, we create a SparkSession and provide the necessary Oracle JDBC driver dependency using the spark.jars.packages configuration. Make sure to replace the hostname, port, and service_name in the url variable with your actual Oracle database connection details.

Next, define the properties dictionary with your Oracle database credentials (user and password).

To read data from a specific table in the Oracle database, use the read.jdbc() method, passing the url, table, and properties as arguments.

After reading the data, you can perform further transformations or analysis on the resulting DataFrame object df. Finally, stop the SparkSession using spark.stop().

Ensure that you have the Oracle JDBC driver accessible to your PySpark environment by either providing the driver JAR file explicitly or using a package management system like Maven or Gradle.

You can apply filters while reading data from an Oracle table using PySpark. The read.jdbc() method in PySpark supports specifying a query as the table parameter, allowing you to apply filters directly in the SQL query. Here's an example:

from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder \

.appName("Oracle Connection") \

.config("spark.jars.packages", "oracle:oracle-jdbc:8.2.0.0") \

.getOrCreate()

# Define the connection properties

url = "jdbc:oracle:thin:@//hostname:port/service_name"

properties = {

"user": "your_username",

"password": "your_password"

}

# Define the filter condition

filter_condition = "column_name = 'value'"

# Read data from Oracle table with filter

query = f"(SELECT * FROM your_table_name WHERE {filter_condition}) AS filtered_table"

df = spark.read.jdbc(url=url, table=query, properties=properties)

# Display the dataframe

df.show()

# Perform further transformations or analysis on the dataframe as needed

# Stop the SparkSession

spark.stop()

Yes, you can apply filters while reading data from an Oracle table using PySpark. The read.jdbc() method in PySpark supports specifying a query as the table parameter, allowing you to apply filters directly in the SQL query. Here's an example:

python

Copy code

from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder \

.appName("Oracle Connection") \

.config("spark.jars.packages", "oracle:oracle-jdbc:8.2.0.0") \

.getOrCreate()

# Define the connection properties

url = "jdbc:oracle:thin:@//hostname:port/service_name"

properties = {

"user": "your_username",

"password": "your_password"

}

# Define the filter condition

filter_condition = "column_name = 'value'"

# Read data from Oracle table with filter

query = f"(SELECT * FROM your_table_name WHERE {filter_condition}) AS filtered_table"

df = spark.read.jdbc(url=url, table=query, properties=properties)

# Display the dataframe

df.show()

# Perform further transformations or analysis on the dataframe as needed

# Stop the SparkSession

spark.stop()

In the code above, we define a filter condition as filter_condition using the desired column name and value. We then construct a SQL query by incorporating the filter condition within a subquery.

The subquery is defined as (SELECT * FROM your_table_name WHERE {filter_condition}) AS filtered_table, where your_table_name is the name of the Oracle table you want to read from.

Pass the constructed query as the table parameter in the read.jdbc() method. This will read only the filtered data from the Oracle table into the DataFrame df.

After reading the filtered data, you can perform further transformations or analysis on the resulting DataFrame object df.

Remember to replace the hostname, port, service_name, your_username, your_password, column_name, and value with the appropriate values for your Oracle database and filtering requirements.

How to write data from a DataFrame to a CSV file using PySpark

from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder \

.appName("Write to CSV") \

.getOrCreate()

# Assume you have a DataFrame named "df" containing the data

# Write the DataFrame to a CSV file

df.write.csv("path/to/your/file.csv", header=True, mode="overwrite")

# Stop the SparkSession

spark.stop()

In the code above, after creating a SparkSession, we assume you have a DataFrame named df that contains the data you want to write to a CSV file.

To write the DataFrame to a CSV file, use the write.csv() method and specify the desired path to the output file. Set header=True to include the column headers in the CSV file.

You can also specify the mode parameter to determine how the file is written. In the example, we use mode="overwrite" to overwrite the file if it already exists. Other options for mode include "append" and "ignore".

After writing the DataFrame to the CSV file, you can find the output at the specified path ("path/to/your/file.csv" in the example).

Finally, stop the SparkSession using spark.stop().

Make sure to replace "path/to/your/file.csv" with the actual path and file name where you want to save the CSV file.

List of SQL functions available in PySpark SQL with examples

Below is a comprehensive list of SQL functions available in PySpark SQL, along with an example for each:

Mathematical Functions:

abs(): Absolute value of a number.

sql

Copy code

SELECT abs(-5) AS result

-- Output: 5

ceil(): Ceiling of a number.

sql

Copy code

SELECT ceil(4.2) AS result

-- Output: 5

floor(): Floor of a number.

sql

Copy code

SELECT floor(4.8) AS result

-- Output: 4

round(): Rounds a number to the specified decimal places.

sql

Copy code

SELECT round(3.14159, 2) AS result

-- Output: 3.14

exp(): Exponential value of a number.

sql

Copy code

SELECT exp(1) AS result

-- Output: 2.718281828459045

log(): Natural logarithm of a number.

sql

Copy code

SELECT log(10) AS result

-- Output: 2.302585092994046

sqrt(): Square root of a number.

sql

Copy code

SELECT sqrt(25) AS result

-- Output: 5.0

power(): Raises a number to the specified power.

sql

Copy code

SELECT power(2, 3) AS result

-- Output: 8.0

String Functions:

concat(): Concatenates multiple strings.

sql

Copy code

SELECT concat('Hello', ' ', 'World') AS result

-- Output: Hello World

substring(): Extracts a substring from a string.

sql

Copy code

SELECT substring('Hello World', 7, 5) AS result

-- Output: World

length(): Length of a string.

sql

Copy code

SELECT length('Hello') AS result

-- Output: 5

trim(): Removes leading and trailing spaces from a string.

sql

Copy code

SELECT trim(' Hello ') AS result

-- Output: Hello

upper(): Converts a string to uppercase.

sql

Copy code

SELECT upper('hello') AS result

-- Output: HELLO

lower(): Converts a string to lowercase.

sql

Copy code

SELECT lower('HELLO') AS result

-- Output: hello

replace(): Replaces occurrences of a substring in a string.

sql

Copy code

SELECT replace('Hello, Hello, Hello', 'Hello', 'Hi') AS result

-- Output: Hi, Hi, Hi

regexp_extract(): Extracts substrings using a regular expression.

sql

Copy code

SELECT regexp_extract('Hello 123 World', '\\d+', 0) AS result

-- Output: 123

Date and Time Functions:

current_date(): Returns the current date.

sql

Copy code

SELECT current_date() AS result

-- Output: 2023-07-09

current_timestamp(): Returns the current timestamp.

sql

Copy code

SELECT current_timestamp() AS result

-- Output: 2023-07-09 12:34:56

date_format(): Formats a date or timestamp.

sql

Copy code

SELECT date_format(current_timestamp(), 'yyyy-MM-dd HH:mm:ss') AS result

-- Output: 2023-07-09 12:34:56

year(), month(), day(): Extracts year, month, and day from a date or timestamp.

sql

Copy code

SELECT year('2023-07-09') AS year, month('2023-07-09') AS month, day('2023-07-09') AS day

-- Output: year=2023, month=7, day=9

datediff(): Calculates the number of days between two dates.

sql

Copy code

SELECT datediff('2023-07-31', '2023-07-01') AS result

-- Output: 30

date_add(), date_sub(): Adds or subtracts a specified number of days to/from a date.

sql

Copy code

SELECT date_add('2023-07-01', 7) AS added_date, date_sub('2023-07-31', 7) AS subtracted_date

-- Output: added_date=2023-07-08, subtracted_date=2023-07-24

Aggregate Functions:

count(): Count of rows or non-null values in a column.

sql

Copy code

SELECT count(*) AS result

-- Output: Total number of rows in the table

sum(): Sum of values in a column.

sql

Copy code

SELECT sum(salary) AS result

-- Output: Sum of salaries in the column

avg(): Average of values in a column.

sql

Copy code

SELECT avg(age) AS result

-- Output: Average age in the column

min(), max(): Minimum and maximum values in a column.

sql

Copy code

SELECT min(price) AS min_price, max(price) AS max_price

-- Output: Minimum and maximum price values

first(), last(): First and last values in a group.

sql

Copy code

SELECT first(name) AS first_name, last(name) AS last_name

-- Output: First and last names in the group

group_concat(): Concatenates values from multiple rows into a single string.

sql

Copy code

SELECT group_concat(name) AS result

-- Output: Concatenated string of names in the group

How to use multiple DataFrames as tables and perform transformations using SQL in PySpark

from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder \

.appName("Multiple DataFrame SQL") \

.getOrCreate()

# Assume we have three DataFrames: df1, df2, and df3

# Register df1 as a temporary table

df1.createOrReplaceTempView("table1")

# Register df2 as a temporary table

df2.createOrReplaceTempView("table2")

# Register df3 as a temporary table

df3.createOrReplaceTempView("table3")

# Execute SQL queries on the registered tables

result = spark.sql("""

SELECT t1.col1, t2.col2, t3.col3

FROM table1 t1

JOIN table2 t2 ON t1.id = t2.id

JOIN table3 t3 ON t2.id = t3.id

WHERE t1.col1 > 10

""")

# Display the result

result.show()

# Perform further transformations or analysis on the result as needed

# Stop the SparkSession

spark.stop()

In the code above, we assume you have three DataFrames: df1, df2, and df3. Each DataFrame is registered as a temporary table using the createOrReplaceTempView() method, assigning them unique table names: table1, table2, and table3.

Next, you can execute SQL queries against these registered tables using the spark.sql() method. In the example, we perform a JOIN operation on the tables and apply a filter condition.

After executing the SQL query, you can work with the result as a new DataFrame (result) and perform further transformations or analysis as needed.

Finally, stop the SparkSession using spark.stop().

Ensure that you have created the DataFrames df1, df2, and df3 before registering them as tables and executing SQL queries against them.

To read data from a Hive table using PySpark

from pyspark.sql import SparkSession

# Create a SparkSession with Hive support

spark = SparkSession.builder \

.appName("Hive Read") \

.enableHiveSupport() \

.getOrCreate()

# Read data from a Hive table

df = spark.sql("SELECT * FROM your_hive_table")

# Display the dataframe

df.show()

# Perform further transformations or analysis on the dataframe as needed

# Stop the SparkSession

spark.stop()

In the code above, we create a SparkSession with Hive support by calling .enableHiveSupport() during the session creation.

To read data from a Hive table, you can use the spark.sql() method and pass a SQL query to select the desired data from the table. Replace "your_hive_table" in the SQL query with the actual name of your Hive table.

After reading the data, you can perform further transformations or analysis on the resulting DataFrame object df. Finally, stop the SparkSession using spark.stop().

Ensure that your Spark cluster is properly configured to work with Hive, and the necessary Hive metastore configuration is set up correctly.

How to connect to an Oracle database using PySpark

from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder \

.appName("Oracle Connection") \

.config("spark.jars.packages", "oracle:oracle-jdbc:8.2.0.0") \

.getOrCreate()

# Define the connection properties

url = "jdbc:oracle:thin:@//hostname:port/service_name"

properties = {

"user": "your_username",

"password": "your_password"

}

# Read data from Oracle table

df = spark.read.jdbc(url=url, table="your_table_name", properties=properties)

# Display the dataframe

df.show()

# Perform further transformations or analysis on the dataframe as needed

# Stop the SparkSession

spark.stop()

Next, define the properties dictionary with your Oracle database credentials (user and password).

To read data from a specific table in the Oracle database, use the read.jdbc() method, passing the url, table, and properties as arguments.

After reading the data, you can perform further transformations or analysis on the resulting DataFrame object df. Finally, stop the SparkSession using spark.stop().

Ensure that you have the Oracle JDBC driver accessible to your PySpark environment by either providing the driver JAR file explicitly or using a package management system like Maven or Gradle.

PySpark code to read data from a CSV file

from pyspark.sql import SparkSession

# Create a SparkSession

spark = SparkSession.builder.appName("CSV Read").getOrCreate()

# Read the CSV file

df = spark.read.csv("path/to/your/file.csv", header=True, inferSchema=True)

# Display the dataframe

df.show()

# Perform further transformations or analysis on the dataframe as needed

# Stop the SparkSession

spark.stop()

In the code above, we import the necessary modules from pyspark.sql, create a SparkSession, and then use the read.csv() method to read the CSV file. The header=True parameter indicates that the first row of the CSV file contains the column names. The inferSchema=True parameter tells Spark to infer the data types of the columns.

After reading the CSV file, you can perform additional transformations, filtering, or analysis on the DataFrame object df. Finally, you can stop the SparkSession using spark.stop().

Make sure to replace "path/to/your/file.csv" with the actual path to your CSV file.

Search

Wednesday 19 July 2023

Sunday 9 July 2023