Search

Thursday 30 November 2023

Simple Chat App with PDF documents using OpenAIEmbeddings, FAISS index, langchain and streamlit

import streamlit as st
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
import pandas as pd

# Load environment variables
load_dotenv(find_dotenv())

# Create an instance of OpenAI
client = OpenAI()

# Create an instance of the OpenAI embeddings
embeddings = OpenAIEmbeddings()

# Create a text splitter
text_splitter = CharacterTextSplitter(separator="\n", chunk_size=1000, chunk_overlap=200, length_function=len)

# Load the question answering chain
chain = load_qa_chain(OpenAI(), chain_type="stuff")

# Streamlit app
st.title('Question Answering App')

# Upload the PDF file
uploaded_file = st.file_uploader("Choose a PDF file", type="pdf")

if uploaded_file is not None:
    # Read the PDF file
    doc_reader = PdfReader(uploaded_file)

    # Split the text into chunks
    raw_text = ''
    for i, page in enumerate(doc_reader.pages):
        text = page.extract_text()
        if text:
            raw_text += text
           
    texts = text_splitter.split_text(raw_text)

    # Create a FAISS index from the texts
    docsearch = FAISS.from_texts(texts, embeddings)

    # Input the question
    query = st.text_input('Enter your question:')

    if st.button('Answer'):
        # Perform a similarity search
        docs = docsearch.similarity_search(query)

        # Run the chain and get the answer
        answer = chain.run(input_documents=docs, question=query)

        # Display the answer
        st.write(answer)

This Python script is a Streamlit application that uses OpenAI’s language model to answer questions about the content of a PDF file. Here’s a breakdown of what each part of the code does:

Import necessary libraries: The script starts by importing necessary libraries such as streamlit, openai, dotenv, PyPDF2, and several modules from langchain.

Load environment variables: The load_dotenv(find_dotenv()) line loads environment variables from a .env file in your project directory. This is typically used to securely manage sensitive information like API keys.

Create instances of OpenAI and OpenAIEmbeddings: These instances are used to interact with the OpenAI API and to create embeddings (vector representations of text), respectively.

Create a text splitter: This is used to split the text from the PDF into manageable chunks.

Load the question answering chain: This is a sequence of operations that takes a question and a set of documents as input and returns an answer.

Streamlit app setup: The st.title('Question Answering App') line sets the title of the Streamlit app.

Upload the PDF file: The st.file_uploader("Choose a PDF file", type="pdf") line creates a file uploader in the Streamlit app that accepts PDF files.

Read the PDF file and split the text into chunks: If a PDF file is uploaded, the script reads the file, extracts the text from each page, and splits the text into chunks using the text splitter created earlier.

Create a FAISS index from the texts: FAISS is a library for efficient similarity search and clustering of dense vectors. The script creates a FAISS index from the chunks of text, which allows it to quickly find chunks of text that are similar to a given query.

Input the question: The st.text_input('Enter your question:') line creates a text input in the Streamlit app where you can enter your question.

Answer the question: If the ‘Answer’ button is clicked, the script performs a similarity search to find chunks of text that are similar to the question, runs the question answering chain on these chunks, and displays the answer in the Streamlit app.