Learn How to Build a RAG Application using GPU Droplets

Published on October 10, 2024

Technical Writer

Learn How to Build a RAG Application using GPU Droplets

Introduction to Retrieval Augmented Generation (RAG) for Language Models

In this article, you will learn how to create a Retrieval-Augmented Generation (RAG) application that can work with your PDFs or other data sources. This type of application is useful for handling large amounts of text data, such as books or lecture notes, to help create a chatbot that can answer any query based on the provided data. The best part is that we will be using an open-source model, so there is no need to pay for API access.

RAG has gained immense popularity and is one of the most used AI frameworks for creating customized chatbots. It’s also a powerful tool for building knowledge-driven AI applications.

RAG can be thought of as an AI assistant that is well-versed in user data and human language. When asked a question, it utilizes a library of information to provide a detailed and accurate answer. It is a powerful combination of an information retrieval system and a robust LLM.

Retrieval-Augmented Generation (RAG) significantly improves accuracy by retrieving relevant information from various external knowledge external sources, enhancing the context and precision of generated responses. By using factual data during the retrieval process, RAG can also reduce hallucinations, a common issue in large language models. Additionally, RAG enhances in-context learning by retrieving specific, up-to-date information, making it ideal for use cases like Q&A, document summarization, and interactive workflows.

For example, we used the research paper of YOLO v9 as the data source, and the LLM was Llama-2-7b-chat. We then asked a few questions about the research paper, and here are relevant results from the replies.

response=query_engine.query("What is YOLOv9")

YOLOv9 is a new object detection algorithm that improves upon existing methods in several ways. First, it uses a new architecture that reduces the number of parameters and computations while maintaining the same level of accuracy. Second, it uses a new technique called PGI to accurately retain and extract the information needed to map the data to the target. Finally, it is shown to have significantly improved performance compared to existing methods in all aspects.

response_2=query_engine.query("what is GELAN architecture?")

GELAN architecture is a deep neural network designed for object detection tasks. It consists of multiple prediction branches, each of which is connected to a deep supervision branch. The deep supervision branch is responsible for providing auxiliary information to the main branch, which is used to improve the accuracy of object detection. The GELAN architecture also includes an integration network that combines returned gradients from different prediction heads, allowing the main branch to retain complete information about all target objects. This allows the method to alleviate the broken information problem in deep supervision and improve the overall performance of object detection tasks.

response_3=query_engine.query("What is new with YOLO v9?")

YOLO v9 introduces several new features, including:

1. Conventional convolution: YOLO v9 uses conventional convolution instead of depth-wise convolution, which leads to better parameter utilization.  
2. PGI: YOLO v9 uses a new technique called PGI (Progressive Gating and Integration) to accurately retain and extract information needed to map the data to the target.  
3. Large models: YOLO v9 shows huge advantages in using large models, requiring only 66% of the parameters while maintaining accuracy as RT DETR-X.  
4. Comparison with state-of-the-arts: YOLO v9 compares with other train-from-scratch real-time object detectors, showing significant improvement in all aspects.

Please let me know if you have any further questions or if there's anything else I can help you with.

We even tried the application with some personal data, and here is the result.

response=query_engine.query("Who is Shaoni")

Shaoni Mukherjee is a seasoned Technical Writer and AI Specialist with a deep passion for Generative AI and its transformative potential. With over four years of experience in data science and a strong foundation in AI/ML technologies, she specializes in creating in-depth, technical content that simplifies complex concepts. Currently contributing to DigitalOcean, Shaoni focuses on topics like GPU acceleration, deep learning, and large language models (LLMs), ensuring that developers and businesses alike can harness cutting-edge technology. Her expertise lies in breaking down technical innovations into digestible, actionable insights, making her a trusted voice in the world of AI.

Prerequisites

Machine Learning Fundamentals: Familiarity with concepts such as embeddings, retrieval systems, and transformers.
DigitalOcean Account: Set up an account with DigitalOcean to access GPU Droplets.
DigitalOcean GPU Droplets: Create and configure GPU Droplets that are optimized for ML workloads.
Transformers Library: Use the transformers library from Hugging Face for loading pre-trained models and fine-tuning them for RAG.
Code Editor/IDE: Set up an IDE like VS Code or Jupyter Notebook for code development.

How Does Retrieval-Augmented Generation (RAG) Work?

We all know that large language models (LLMs) are great at generating responses, but if you ask question based on your companies financial status, it will fail and start giving inaccurate information. This happens because LLMs lack access to our personnel and updated data. By incorporating retrieval-augmented generation (RAG) features into foundation models, we can provide the LLM with our personnel and updated data. This allows us to ask any financial query to the LLM application, and it will provide answers based on the accurate information we provide as the data source. When we add retrieval-augmented features to a large language model (LLM), it changes how the model finds answers. Instead of only using what it already knows, the LLM now has access to more accurate information.

Here’s how it works:

User Input: A user asks a question.
Retrieval Step: The LLM first checks the data store to find relevant information about the user’s question.
Response Generation: After retrieving this information, the LLM combines it with its knowledge to provide a more accurate and informed answer.

Images

This approach allows the model to improve its responses by incorporating additional information, it’s own data, rather than relying solely on its existing knowledge. RAG (Retriever-Augmented Generation) helps to avoid the need to retrain the model with new data. Instead, we can simply update our existing training data often. For instance, if new insights or data are discovered, we can add this new information to our existing resources. As a result, when a user asks a question, the model can access this updated content without going through the entire training process again. This ensures that the model is always capable of providing the most current and relevant answers based on the latest data.

Implementing this approach reduces the likelihood of the model generating incorrect information. It also enables the model to acknowledge when it doesn’t have an answer, if it can’t find a sufficient response within the data store. However, if the retriever doesn’t provide the foundation model with high-quality information, the model might miss answering a question it could have otherwise addressed.

1. User Input (Query)

A user asks a question or provides input for an augmented prompt, which can be a statement, query, or task.

2. Query Encoding

The user’s input is first converted into a machine-readable format using an embedding model. Embeddings represent the meaning of the query in a vector (numeric) form, making it easier to match user preferences with relevant information. This numerical representation is stored in a vector database.

3. Retriever

Search for Relevant Data: The encoded query is passed to a retrieval system that searches the vector database. The retriever looks for chunks of text, documents, or data most relevant to the query.
- The data source can be knowledge bases, articles, or company-specific documentation.
- Prompts in RAG help bridge the gap between retrieval systems and generative models, ensuring that the model produces accurate and relevant answers.
Return Results: The retriever returns the top-ranked documents or information that matches the user’s query. These pieces of information are often referred to as “documents” or “passages.”

4. Combination of Retrieval and Model Knowledge

The retrieved data is fed into a generative language model (like GPT or other LLM). This model combines the information it retrieved with its pre-existing knowledge to generate a response.
- Grounding the Response: The key difference here is that the model doesn’t only rely on its internal knowledge (learned during training). Instead, it uses the fresh, external data retrieved to provide a more informed, accurate answer.

Code Demo and Explanation

We recommend going through the tutorial to set up the GPU Droplet and run the code. We have added a link to the references section that will guide you through creating a GPU Droplet and configuring it using VSCode. To begin, we will need a PDF, Markdown, or any documentation files. Make sure to create a separate folder to store the PDFs.
Start by installing all the necessary packages. The below code provides all the necessary packages to be installed as a first step.

!pip install pypdf  
!pip install -U bitsandbytes  
!pip install langchain  
!pip install -U langchain-community  
!pip install sentence_transformers  
!pip install llama_index  
!pip install llama-index-llms-huggingface  
!pip install llama-index-llms-huggingface-api  
!pip install llama-index-embeddings-langchain

from llama_index.core import VectorStoreIndex,SimpleDirectoryReader,ServiceContext,PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM  
from llama_index.core.prompts.prompts import SimpleInputPrompt
from llama_index.embeddings.openai import OpenAIEmbedding  
from llama_index.core.node_parser import SentenceSplitter  
# from llama_index.llms.openai import OpenAI  
from llama_index.core import Settings
from langchain.embeddings.huggingface import HuggingFaceEmbeddings  
from langchain.embeddings import HuggingFaceEmbeddings

The following section contains the complete code for building the RAG application. Each step is explained throughout the article as you read along.

import torch
documents=SimpleDirectoryReader("your/pdf/location/data").load_data()

# print(documents)

system_prompt="""  
You are a Q&A assistant. Your goal is to answer questions as  
accurately as possible based on the instructions and context provided.  
"""  
## Default format supportable by LLama2  
query_wrapper_prompt=SimpleInputPrompt("\<|USER|\>{query\_str}\<|ASSISTANT|\>")

!huggingface-cli login

llm = HuggingFaceLLM(  
   context_window=4096,  
   max_new_tokens=256,  
   generate_kwargs={"temperature": 0.0, "do_sample": False},  
   system_prompt=system_prompt,  
   query_wrapper_prompt=query_wrapper_prompt,  
   tokenizer_name="meta-llama/Llama-2-7b-chat-hf",  
   model_name="meta-llama/Llama-2-7b-chat-hf",  
   device_map="auto",  
   model_kwargs={"torch_dtype": torch.float16 , "load_in_8bit":True}  
)

embed_model = HuggingFaceEmbeddings(  
   model_name="sentence-transformers/all-mpnet-base-v2"  
)
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)  
Settings.num_output = 512  
Settings.context_window = 3900  
# a vector store index only needs an embed model  
index = VectorStoreIndex.from_documents(  
   documents, embed_model=embed_model
)  
# create a query engine
query_engine = index.as_query_engine(llm=llm)
response=query_engine.query("what is GELAN architecture?")
print(response)

Once we store the data, it needs to be split into chunks. The code below loads the data and splits it into chunks.

# load the data
documents=SimpleDirectoryReader("//your repo path/data").load_data()

# split the data into chunks
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)

The document will contain all the content or text and metadata. Now, a document can be really long, so we need to split each document into smaller chunks. This is part of the preprocessing step for preparing the data for RAG. These smaller, focused pieces of information help the system find and retrieve the relevant context and details more accurately. By breaking documents into clear sections, it’s easier to locate domain specific information, in passages or facts, increasing the RAG application’s performance. We can even use “RecursiveCharacterTextSplitter” from “langchain.text_splitter” in our case we are using “SentenceSplitter” from “llama_index.core.node_parser.”

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(  
        chunk_size=300,  
        chunk_overlap=100,  
        length_function=len,  
        add_start_index=True,  
    )

For more information on RecursiveCharacterTextSplitter please visit the link in the reference section.

Now, we will learn about the embeddings!

Embeddings are numerical representations of text data that help capture the data’s underlying meaning. They convert data into vectors, essentially arrays of numbers, making them easier for machine learning models to understand and work with.

In the case of text embeddings (e.g., word or sentence embeddings), vectors are designed so that words or phrases with similar meanings are close to each other in the vector space. For instance, “king” and “queen” would have close vectors, while “king” and “apple” would be far apart. Further, the distance between these vectors can be calculated by cosine similarity or Euclidean distance.

For example, here, we will use “sentence-transformers/all-mpnet-base-v2” from HuggingFaceEmbeddings.

from langchain.embeddings.huggingface import HuggingFaceEmbeddings  
embed_model = HuggingFaceEmbeddings(  
   model_name="sentence-transformers/all-mpnet-base-v2"  
)

This step involves selecting a pre-trained model in this case ‘sentence-transformers/all-mpnet-base-v2’, to generate the embeddings due to its compact size and strong performance. We can pick a model from the Sentence Transformers library, which maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search in search engines.

# a vector store index only needs an embed model  
from llama_index.core import VectorStoreIndex  
index = VectorStoreIndex.from_documents(  
   documents, embed_model=embed_model  
)

The same embedding model will then be used to create the embeddings for the documents during the index construction process and for any queries for the query engine.

# create a query engine  
query_engine = index.as_query_engine(llm=llm)

response=query_engine.query("Who is Shaoni")  
print(response)

Now, let us talk about our LLM, here we are using Llama 2, 7B fine-tuned model for our example. Meta has developed and released the Llama 2 family of large language models (LLMs), which includes a range of pre-trained and fine-tuned generative text models with sizes from 7 billion to 70 billion parameters. These models consistently outperformed many open-source chat models, and they are comparable to popular closed-source models like ChatGPT and PaLM.

Key Details

Model Developers: Meta
Variations: Llama 2 is available in sizes 7B, 13B, and 70B, with both pre-trained and fine-tuned options.
Input/Output: The models take in text and generate text as output.
Architecture: Llama 2 uses an auto-regressive transformer architecture, with tuned versions employing supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to better align with human preferences for helpfulness and safety. However, please feel free to use any other model. Many open-source models from Hugging Face need a short introduction before each prompt, called a system_prompt. Also, the queries might require an extra wrapper around the query_str. Here we will use both the system_prompt and query_wrapper_prompt.

system_prompt="""
You are a Q&A assistant. Your goal is to answer questions as
accurately as possible based on the instructions and context provided.
"""  
## Default format supportable by LLama2  
query_wrapper_prompt=SimpleInputPrompt("\<|USER|\>{query\_str}\<|ASSISTANT|\>")

Now, we can use our LLM, embedded model, and documents to ask questions about them and then use the lines of code provided here.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents=SimpleDirectoryReader("//your repo path/data").load_data()  
index = VectorStoreIndex.from_documents(  
   documents, embed_model=embed_model  
)  
query_engine = index.as_query_engine(llm=llm)  
response=query_engine.query("what are the drabacks discussed in yolo v9?")  
print(response)

YOLOv9 has several drawbacks discussed in the paper, including:

1. Computational complexity: While YOLOv9 is Pareto optimal in terms of accuracy and computation complexity among all models with different scales, it still has a relatively high computation complexity compared to other state-of-the-art methods.
2. Parameter utilization: YOLOv9 using conventional convolution has lower parameter utilization than YOLO MS using depth-wise convolution, and even worse, large models of YOLOv9 have lower parameter utilization than RT DETR using ImageNet pretrained model.
3. Training time: YOLOv9 requires a longer training time compared to other state-of-the-art methods, which can be a limitation for real-time object detection applications.

Please let me know if you have any further questions or if there's anything else I can help you with.

Why use GPU Droplet to build the next-gen AI-powered applications?

Though this tutorial does not require our readers to have a high-end GPU however, standard CPUs will not be sufficient to handle the computation efficiently. Hence handling more complex operations—such as generating vector embeddings or using large language models—will be much slower and may lead to performance issues. For optimal performance and faster results, it’s recommended to use a capable GPU, especially when we have a large number of documents or datasets or if we are using a more advanced LLM like Falcon 180b. Using DigitalOcean’s GPU Droplets for creating a Retrieval-Augmented Generation (RAG) application, will offer several benefits:

Speed: GPU Droplets are designed to handle complex calculations quickly, essential for processing large amounts of data. This means it can generate embeddings for a large dataset in a shorter time.
Efficiency with Large Models: RAG applications, as we saw in our tutorial, use large language models (LLMs) to generate responses based on retrieved information. The H100 GPUs can efficiently run these models, enabling them to handle tasks like understanding context and generating human-like text. For instance, if you want to create an intelligent chatbot that answers questions based on a knowledge base and you have a library of documents, using the GPU Droplet will help to process the data and generate user queries quickly.
Better Performance: With the H100’s advanced architecture, users can expect higher performance when working with vector embeddings and LLMs. This means your RAG application will be able to retrieve relevant information and generate more accurate and contextually appropriate responses.
Scalability: If the application grows and needs to handle more users or data, H100 GPU Droplets can easily scale to meet those demands. This means fewer worries about performance issues as the application becomes more popular.

Concluding Thoughts

In conclusion, Retrieval-Augmented Generation (RAG) is an important AI framework that significantly enhances the capabilities of large language models (LLMs) to create AI applications. By effectively combining the strengths of information retrieval with the power of large language models, RAG systems can deliver accurate, contextually relevant, and informative responses. This integration improves the quality of interactions across various domains—such as customer support, content creation, and personalized recommendations—and allows organizations to leverage vast amounts of data efficiently. As the demand for intelligent, responsive applications grows, RAG will stand out as a powerful framework that helps developers build more intelligent systems that better serve users’ needs. Its adaptability and effectiveness make it a key player in the future of AI-driven solutions.

Additional References

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Shaoni Mukherjee

Author

Technical Writer

See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

Category:

Tags: