07 chat with pdf document using langchain

In [ ]:

from openai import OpenAI  # Works only with openai version >= 1.2.0
from dotenv import load_dotenv,find_dotenv
import pandas as pd
load_dotenv(find_dotenv())
client = OpenAI()
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

Reading in the PDF

In [ ]:

doc_reader = PdfReader('impromptu-rh.pdf')

In [ ]:

doc_reader

Out[ ]:

<PyPDF2._reader.PdfReader at 0x26b1132ab30>

In [ ]:

# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(doc_reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

In [ ]:

len(raw_text)

Out[ ]:

In [ ]:

raw_text[:100]

Out[ ]:

'Impromptu\nAmplifying Our Humanity \nThrough AI\nBy Reid Hoffman  \nwith GPT-4Impromptu: AmplIfyIng our '

Text Splitter This takes the text and splits it into chunks. The chunk size is characters not tokens

In [ ]:

# Splitting up the text into smaller chunks for indexing
text_splitter = CharacterTextSplitter(        
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200, #striding over the text
    length_function = len,
)
texts = text_splitter.split_text(raw_text)
len(texts)

Out[ ]:

In [ ]:

texts[20]

Out[ ]:

'Because, really, an AI book? When things are moving so \nquickly? Even with a helpful AI on hand to speed the process, \nany such book would be obsolete before we started to write it—\nthat’s how fast the industry is moving.\nSo I hemmed and hawed for a bit. And then I thought of a frame \nthat pushed me into action.\nThis didn’t have to be a comprehensive “book” book so much as \na travelog, an informal exercise in exploration and discovery, \nme (with GPT-4) choosing one path among many. A snapshot \nmemorializing—in a subjective and decidedly not definitive \nway—the AI future we were about to experience.\nWhat would we see? What would impress us most? What would \nwe learn about ourselves in the process? Well aware of the brief \nhalf-life of this travelog’s relevance, I decided to press ahead.\nA month later, at the end of November 2022, OpenAI released \nChatGPT, a “conversational agent,” aka chatbot, a modified \nversion of GPT-3.5 that they had fine-tuned through a process'

In [ ]:

texts[10]

Out[ ]:

'one. They’re just going to stare at you with that smug \nlook on their face and say, “You know this is a serious \nsafety hazard. You could have a fire, you could have a \nlawsuit, you could have a bad Yelp review. You better get \non that, pal.” And then they leave, and you’re left with a \ndark corner of your dining room, where no one wants to \nsit, where the food looks unappetizing, where the mood \nis gloomy. And you think to yourself, how many restau-\nrant inspectors does it take to change a lightbulb? None. \nThey don’t change lightbulbs. They just make you feel \nbad about yours.\nPretty good, right? Here’s another:\nReid: How many restaurant inspectors does it take to \nchange a lightbulb? Answer in the style of the philoso -\npher Ludwig Wittgenstein. \nGPT-4:  The question of how many restaurant inspectors \nit takes to change a lightbulb cannot be answered with \na simple number. In fact, the answer to this question lies \nmore in the realm of language than it does in the realm'

Making the embeddings

In [ ]:

# Download embeddings from OpenAI: The OpenAIEmbeddings() function is used to download embeddings from OpenAI. 
#These embeddings are vector representations of words or phrases, which capture their meanings.
embeddings = OpenAIEmbeddings()
#The FAISS.from_texts(texts, embeddings) function is used to create a FAISS (Facebook AI Similarity Search) index 
# from the texts using the downloaded embeddings. This index can be used to perform efficient similarity search 
# among the texts.
docsearch = FAISS.from_texts(texts, embeddings)
docsearch.embedding_function

Out[ ]:

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x0000026B34CE2C80>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x0000026B34D04F40>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version='', openai_api_base=None, openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key='sk-xxxxxxxx', openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, http_client=None)

In [ ]:

#The docsearch.similarity_search(query) function is used to perform a similarity search in the FAISS index
# using the query. This returns a list of documents that are most similar to the query.
query = "how does GPT-4 change social media?"
docs = docsearch.similarity_search(query)
len(docs)

Out[ ]:

In [ ]:

docs[0]

Out[ ]:

Document(page_content='cian, GPT-4 and ChatGPT are not only able but also incredi-\nbly willing to focus on whatever you want to talk about.4 This \nsimple dynamic creates a highly personalized user experience. \nAs an exchange with GPT-4 progresses, you are continuously \nfine-tuning it to your specific preferences in that moment. \nWhile this high degree of personalization informs whatever \nyou’re using GPT-4 for, I believe it has special salience for the \nnews media industry.\nImagine a future where you go to a news website and use \nqueries like these to define your experience there:\n4  Provided it doesn’t violate the safety restrictions OpenAI has put on \nthem.93Journalism\n● Hey, Wall Street Journal, give me hundred-word summa-\nries of your three most-read tech stories today.\n● Hey, CNN, show me any climate change stories that hap-\npened today involving policy-making.\n● Hey, New York Times, can you create a counter-argument \nto today’s Paul Krugman op-ed, using only news articles')

Plain QA Chain

In [ ]:

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

In [ ]:

#load_qa_chain(OpenAI(), chain_type="stuff") is loading a specific type of tool called a “question answering chain”. 
#This tool is designed to answer questions based on a set of documents.
#The chain_type="stuff" part is specifying the type of question answering chain to load. 
#In this case, it’s a “stuff” type, which means all the documents are going to be “stuffed” or 
#inputted into the chain at once.
chain = load_qa_chain(OpenAI(), 
                      chain_type="stuff") # we are going to stuff all the docs in at once
chain

Out[ ]:

StuffDocumentsChain(llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['context', 'question'], template="Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"), llm=OpenAI(client=<openai.resources.completions.Completions object at 0x0000026B3B81D330>, async_client=<openai.resources.completions.AsyncCompletions object at 0x0000026B3B839540>, openai_api_key='sk-xxxxxx', openai_proxy='')), document_variable_name='context')

In [ ]:

# check the prompt
chain.llm_chain.prompt.template

Out[ ]:

"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"

In [ ]:

#The chain.run(input_documents=docs, question=query) function is used to run the chain on the input documents 
#and question. This generates an answer to the question based on the input documents.
query = "who are the authors of the book?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

Out[ ]:

' Reid Hoffman and Sam Altman.'

In [ ]:

query = "who is the author of the book?"
query_02 = "has it rained this week?"
docs = docsearch.similarity_search(query_02)
chain.run(input_documents=docs, question=query)

Out[ ]:

" I don't know."

The Github Experiment -By Ajay Gupta

Search

Thursday, 30 November 2023

Simple Chat with PDF documents using OpenAIEmbeddings, FAISS index and langchain