Interactive Conversations with PDFs Using Langchain

Updated on December 24, 2024

By Mohita Narang and jamesskelton

Interactive Conversations with PDFs Using Langchain

Gone are the days when interacting with PDFs was cumbersome and time-consuming. Users had to open the documents manually using software like Adobe Reader and read through the entire document or use essential search functions to find specific information. But now chatting with an AI assistant is simple with the integration of LangChain. Users can upload PDFs to a LangChain enabled LLM application and receive accurate answers within seconds, through a process called Optical character recognition (OCR).

This benefits businesses requiring customized interaction with company policies, documents, or reports. It can even help researchers and students to identify the important parts and avoid reading the whole book or research paper.

Introduction to PDF Analyser

1.Document Handling and Embeddings: Load the PDF document using a suitable loader like PyPDFLoader. Clean and structure the text data (removing headers/footers, handling special characters, and segmenting text into paragraphs or sections). It could also involve tokenization (breaking text into words), stemming/lemmatization (reducing words to their root form), or stop word removal (eliminating common words like "the ‘’ or "a ‘’) at this step. 2.Vector Embeddings and Retrieval: This involves creating vector representations of the text chunks extracted from the PDFs. These vectors capture the semantic meaning and relationships between words. The chatbot generates a vector for the query and compares it with the stored document vectors during user queries. Documents with the closest vector match are considered the most relevant for the user’s question. Libraries like Gensim or Faiss can be used for this approach. 3.Language Generation Pipeline: Set up the language generation pipeline using AutoModelForSeq2SeqLM and AutoTokenizer. 4.Create a Retriever and Integrate with LLM: Develop a custom retriever using RetrievalQA to fetch relevant documents based on queries. 5.Querying the Chain: Test the system by querying the chain with questions related to the PDF content.

Prerequisites

Python Environment: Install Python 3.7+ and set up a virtual environment.
LangChain Library: Install LangChain (pip install langchain).
Text Extractor: Install a library for PDF text extraction like PyPDF2 or pdfplumber.
Vector Database: Install and configure a vector database (e.g., FAISS, Pinecone) for embeddings.

Interacting with PDFs Today

Now, it’s very easy to understand the contents of PDFs. Just upload the PDF to the LLM application and ask questions about the content in the PDF. It’s the same as chatting with ChatGPT, but users can upload the PDFs directly.

Customized Document Handling

Businesses can now customize the document handling system for more precise interactions with company documents, policies, or reports in PDF format. A vast repository of PDF documents can be prepared for employees, and LLMs can be trained on it. Users can simply upload the document and ask questions in plain language, “What are the company’s sick leave policies?”, or the sales team can quickly query up-to-date technical specifications or product information from PDF catalogs or manuals.

Dynamic Content Retrieval

RAG (Retrieval Augmented Generation) techniques can incorporate external data in real-time. This means businesses with LLM powered applications can access the most current information from the company database using RAG techniques. This ensures that the generated responses are current and can help decision-making. Imagine a sales team asking about a product’s availability. To provide the latest stock status, the LLM not only retrieves information from the PDF manual but also accesses the company’s inventory database.

Secure and Efficient Information Extraction

Confidentiality is very important in sectors like financial services and legal services. LLMs can maintain privacy and security by providing information from sensitive PDF documents without exposing the entire context, ensuring only authorized information is accessed.

Application in Knowledge Bases

As new policies or procedures are uploaded as PDFs, an LLM can scan and extract information from PDFs and update the knowledge base by updating FAQs accordingly. LangChain has built in functional integrations with popular storage solutions like Redis.

Improved Customer Satisfaction

Customers can get personalized interaction and quick access to relevant information by integrating PDF interaction chatbots. For example, a chatbot can guide customers in assembling a piece of furniture from IKEA. It can provide step-by-step instructions by referring to the PDF user manual and ensure a smooth customer shopping experience.

We have tried a PDF interaction demo using Langchain below. But why use Langchain?

Lanchain offers pre-built components like retrieval systems, document loaders, and LLM integration tools. LangChain components have already been tested to ensure effective working with documents and alarms. It improves the overall efficiency of the development process and reduces the risk of errors.

Demo Code

This demo used a pre-trained hugging face model, ‘flan-t5-large’. Other open-source models, like FastChatT5 3b Model and Falcon 7b Model, can also be used for this. Start the Gradient Notebook by choosing the GPU and cloning the repository. This repository did not have requirements.txt, so the dependencies were installed separately.

Model and Document Loading

Embedding_Model = "hkunlp/instructor-xl"
LLM_Model = "google/flan-t5-large"
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("/content/langchain/qna_from_pdf_file_with_citation_using_local_Model/pdf_files/TRANSFORMERS.pdf")          #path to PDF document
documents = loader.load_and_split()

‘hkunlp/instructor-xl’ is the Embedding_Model, and ‘google/flan-t5-large’ is used as LLM_Model defines pre-trained models for text embedding and language generation, respectively. This pre-trained model is taken from HuggingFace.
A PyPDFLoader loads the PDF file by giving the path to the PDF document. Here, only one PDF document is loaded. Multiple PDF documents can be loaded into the folder, and a path to the folder can also be given.
The load_and_split method of the loader reads and splits the PDF content into individual sections or documents for processing.

Testing the Embeddings Mechanism

from langchain_community.embeddings import HuggingFaceInstructEmbeddings
instructor_embeddings = HuggingFaceInstructEmbeddings(model_name=Embedding_Model)
text = "This is a test document."
query_result = instructor_embeddings.embed_query(text)

Testing the embedding generation process is common practice before integrating it into a larger system, such as a question-answering system that processes PDF documents. With the selected embedding model, an instance of HuggingFaceInstructEmbeddings is created.

3. Language Generation Pipeline

import torch
import transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
tokenizer = AutoTokenizer.from_pretrained(LLM_Model)
model = AutoModelForSeq2SeqLM.from_pretrained(LLM_Model, torch_dtype=torch.float32)

pipe = pipeline(
   "text2text-generation",
   model=model,
   tokenizer=tokenizer,
   max_length=512,
   temperature=0,
   top_p=0.95,
   repetition_penalty=1.15
)
llm = HuggingFacePipeline(pipeline=pipe)

AutoTokenizer.from_pretrained(LLM_Model)-This convert text into a format that the model can understand.

AutoModelForSeq2SeqLM.from_pretrained(LLM_Model, torch_dtype=torch.float32): This line of code is likely used in a Python script employing the Transformers library for Natural Language Processing (NLP) tasks.
AutoModelForSeq2SeqLM: This part of the code refers to a pre-trained model architecture specifically designed for sequence-to-sequence learning tasks. It’s used for tasks like machine translation, summarization, and text generation.
.from_pretrained(LLM_Model): This section loads a pre-trained LLM (Large Language Model) from the transformers library’s model hub.
torch_dtype=torch.float32: torch.float32 indicates that the model will use 32-bit floating-point precision.
pipe = pipeline: Creates a text-to-text generation pipeline for generating text with the model.

Parameters for the pipeline:

model, tokenizer: Specify the model and tokenizer to use.
max_length: Limits the maximum length of the generated text to 512 tokens.
temperature (0): Controls randomness in generation (0 means deterministic).
top_p (0.95): Filters potential next tokens for more likely responses.
repetition_penalty (1.15): Discourages repetitive text generation.

4. Create a retriever from the index and integrate it with LLM

from langchain_core.retrievers import BaseRetriever
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from typing import List
class CustomRetriever(BaseRetriever):
   def _get_relevant_documents(
       self, query: str, *, run_manager: CallbackManagerForRetrieverRun
   ) -> List[Document]:
       return [Document(page_content=query)]
retriever = CustomRetriever()
retriever.get_relevant_documents("bar")
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=retriever,
                                 return_source_documents=True)
question = "explain Power Transformers?"

This code retrieves relevant documents based on a query and generates answers to questions using those documents. It integrates the retrieved component with a QA pipeline in the LangChain framework.

5. Query the Chain

question = "Ideal transformers can be characterized as?"
generated_text = qa(question)
Generated_text

qa(question) call in real-time or interact with the LangChain framework directly and will generate the output.

For integrating this chat feature in the application, training and finetuning is required first. GPUs can fastrack that process. For real-time chats with PDF documents, using CPU will be insufficient as it will result in prohibitively long wait times for customer responses. In that case also a high-power GPU will be needed.

Closing Thoughts

Integrating RAG techniques, will streamline chatbot conversations and ensure secure and up-to-date information retrieval from PDF documents. More enhancements could be made to the PDF analysis. For example, leveraging OCR technology for scanning PDFs or handwritten documents effectively, telling about the source citations. Dive into the world of advanced PDF analysis and chatbot interactions.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products