The Github Experiment -By Ajay Gupta: Simple Chat App with PDF documents using OpenAIEmbeddings, FAISS index, langchain and streamlit

import streamlit as st
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS 
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
import pandas as pd

# Load environment variables
load_dotenv(find_dotenv())

# Create an instance of OpenAI
client = OpenAI()

# Create an instance of the OpenAI embeddings
embeddings = OpenAIEmbeddings()

# Create a text splitter
text_splitter = CharacterTextSplitter(separator="\n", chunk_size=1000, chunk_overlap=200, length_function=len)

# Load the question answering chain
chain = load_qa_chain(OpenAI(), chain_type="stuff")

# Streamlit app
st.title('Question Answering App')

# Upload the PDF file
uploaded_file = st.file_uploader("Choose a PDF file", type="pdf")

if uploaded_file is not None:
    # Read the PDF file
    doc_reader = PdfReader(uploaded_file)

    # Split the text into chunks
    raw_text = ''
    for i, page in enumerate(doc_reader.pages):
        text = page.extract_text()
        if text:
            raw_text += text
            
    texts = text_splitter.split_text(raw_text)

    # Create a FAISS index from the texts
    docsearch = FAISS.from_texts(texts, embeddings)

    # Input the question
    query = st.text_input('Enter your question:')

    if st.button('Answer'):
        # Perform a similarity search
        docs = docsearch.similarity_search(query)

        # Run the chain and get the answer
        answer = chain.run(input_documents=docs, question=query)

        # Display the answer
        st.write(answer)

This Python script is a Streamlit application that uses OpenAI’s language model to answer questions about the content of a PDF file. Here’s a breakdown of what each part of the code does:

Import necessary libraries: The script starts by importing necessary libraries such as streamlit, openai, dotenv, PyPDF2, and several modules from langchain.

Load environment variables: The load_dotenv(find_dotenv()) line loads environment variables from a .env file in your project directory. This is typically used to securely manage sensitive information like API keys.

Create instances of OpenAI and OpenAIEmbeddings: These instances are used to interact with the OpenAI API and to create embeddings (vector representations of text), respectively.

Create a text splitter: This is used to split the text from the PDF into manageable chunks.

Load the question answering chain: This is a sequence of operations that takes a question and a set of documents as input and returns an answer.

Streamlit app setup: The st.title('Question Answering App') line sets the title of the Streamlit app.

Upload the PDF file: The st.file_uploader("Choose a PDF file", type="pdf") line creates a file uploader in the Streamlit app that accepts PDF files.

Read the PDF file and split the text into chunks: If a PDF file is uploaded, the script reads the file, extracts the text from each page, and splits the text into chunks using the text splitter created earlier.

Create a FAISS index from the texts: FAISS is a library for efficient similarity search and clustering of dense vectors. The script creates a FAISS index from the chunks of text, which allows it to quickly find chunks of text that are similar to a given query.

Input the question: The st.text_input('Enter your question:') line creates a text input in the Streamlit app where you can enter your question.

Answer the question: If the ‘Answer’ button is clicked, the script performs a similarity search to find chunks of text that are similar to the question, runs the question answering chain on these chunks, and displays the answer in the Streamlit app.

The Github Experiment -By Ajay Gupta

Search

Thursday 30 November 2023

Simple Chat App with PDF documents using OpenAIEmbeddings, FAISS index, langchain and streamlit