Retrieval-Augmented Generation (RAG) applications have fundamentally changed how we access information. By combining information retrieval with generative AI, RAG models deliver precise and contextually relevant outputs. However, the success of a RAG application hinges on one crucial factor: the quality of its dataset.
By the end of this article, you will have a clear understanding of:
Not all data is created equal, and the distinction between “good” and “bad” data can make or break your RAG model. In this article, we’ll explore what sets good data apart, why bad data can derail your efforts, and how to gather the right kind of data to power your RAG application. This is an excellent primer for curating your dataset for creating an AI Agent with the Digital Ocean GenaI Platform.
To fully benefit from this article, it’s helpful to have some prior knowledge or experience in the following areas:
If these concepts are new to you, consider exploring introductory resources or tutorials before diving deeper into dataset creation for RAG applications.
RAG combines a retriever that fetches relevant information from a dataset with a generator that uses this data to craft insightful responses. This dual approach makes RAG applications incredibly versatile, with use cases ranging from customer support bots to medical diagnostics.
The dataset forms the backbone of this process, acting as the knowledge source for retrieval and generation. High-quality data ensures the retriever fetches accurate and relevant content while the generator produces coherent, contextually appropriate outputs. There is an old saying in the RAG space… “garbage in, garbage out”. As simple as the saying is, it’s really indicative of the challenges that datasets can face when irrelevant or noisy data.
The retriever is responsible for identifying and fetching the most relevant information from a dataset. It typically uses techniques such as vector search, BM25, or semantic search powered by dense embeddings to locate content that matches the user’s query. The retriever’s ability to identify contextually appropriate data relies heavily on the quality and structure of the dataset. For example:
Once the retriever fetches the relevant data, the generator takes over. Using generative AI models like Meta Llama, Falcon, or other transformers, the generator synthesizes this information into a coherent and contextually relevant response. The interaction between the generator and the retriever is critical:
The interplay between the retriever and generator can be likened to a relay race. The retriever passes the baton—in the form of retrieved information—to the generator, which then delivers the final output. A breakdown in this handoff can significantly impact the application:
What separates good data from bad? Let’s break it down:
Relevance: Your data should align with your application’s domain. For example, a legal RAG tool must prioritize legal documents over unrelated articles.
Accuracy: Data should be factual and verified. Incorrect information can lead to erroneous outputs.
Diversity: Incorporate varied perspectives and examples to prevent narrow responses.
Balance: Avoid over-representing specific topics, helping to ensure fair and unbiased outputs.
Structure: Well-organized data allows efficient retrieval and generation.
To build a winning dataset:
Define Clear Objectives: Understand your RAG application’s purpose and audience.
Source Reliably: Use trustworthy, domain-specific sources like scholarly articles or curated databases.
Filter and Clean: Use preprocessing tools to remove noise, duplicates, and irrelevant content.
Example Cleaning Text: Use NLTK for text normalization:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text = "Sample text for cleaning."
tokens = word_tokenize(text)
filtered = [word for word in tokens if word not in stopwords.words('english')]
Example Cleaning Data: Use Python with pandas:
import pandas as pd
# Load dataset
df = pd.read_csv('data.csv')
# Remove duplicates
df = df.drop_duplicates()
# Filter out irrelevant rows based on criteria
df = df[df['relevance_score'] > 0.8]
df.to_csv('cleaned_data.csv', index=False)
Annotate Data: Label data to highlight context, relevance, or priority.
APIs for Specialized Data: Leverage APIs for domain-specific datasets.
Update Regularly: Keep your dataset fresh to reflect evolving knowledge.
This section will consolidate what we’ve learned and explore a practical example. Suppose you are creating a dataset for a Kubernetes Retrieval-Augmented Generation (RAG)-based chatbot and need to identify effective data sources. A natural starting point might be the Kubernetes Documentation. Documentation is often a valuable dataset foundation, but it can be challenging to extract relevant content while avoiding unnecessary or extraneous data. Remember, the quality of your dataset determines the quality of your results: garbage in, garbage out.
A common approach to extracting content from documentation websites is web scraping (please note - some site terms may prohibit this activity - review terms before you scrape). Since most of this content is stored as HTML, tools like BeautifulSoup can help isolate user-visible text from other elements like JavaScript, styling, or comments meant for web designers.
Here’s how you can use BeautifulSoup to extract text data from a webpage:
First, install the necessary Python libraries:
pip install beautifulsoup4 requests
Use the following Python script to fetch and parse the webpage:
from bs4 import BeautifulSoup
import requests
# Define the URL of the target webpage
url = "https://example.com"
# Fetch the webpage content
response = requests.get(url)
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Extract and clean data (e.g., all text in paragraph tags)
data = [item.text for item in soup.find_all('p')]
# Print the extracted data
for line in data:
print(line)
While web scraping can be effective, it often requires significant post-processing to filter out irrelevant elements. Instead of scraping the rendered documentation, consider obtaining the raw source files directly.
For the Kubernetes Documentation, the underlying Markdown files are stored in the Kubernetes website GitHub repository. Markdown files typically provide cleaner, structured content that requires less preprocessing.
To access the Markdown files, clone the GitHub repository to your local machine:
git clone https://github.com/kubernetes/website.git
Once cloned, you can locate and list all Markdown files using Bash. For example:
# cloing the repo
git clone git@github.com:kubernetes/website.git
# change directory to the repo
cd ./website
# deleting everything but the markdown files
find . -type f ! -name "*.md" -delete
# delete all the empty directories for completeness
find . -type d -empty -delete
Accessing the source Markdown files offers several advantages:
By considering the structure and origin of your data sources, you can reduce preprocessing efforts and build a higher-quality dataset. For Kubernetes-related projects, starting with the repository’s Markdown files ensures you’re working with well-organized and more accurate content.
The quality of your dataset is the foundation of a successful RAG application. By focusing on relevance, accuracy, diversity, balance, and structure, you can help ensure your model performs reliably and meets user expectations. Before you include the data in your dataset, take a step back and think about the different sources to obtain your data and the process you will need to clean that data.
A good analogy to keep in mind is drinking water. If you start with a poor source of water like the ocean, you may spend a significant amount of time purifying that water source so that the consumer won’t get ill from drinking that water. Conversely, if you research and explore where naturally purified water sources exist, like spring water, you may save yourself time having to perform the labor-intensive task of cleaning the water.
Always remember that building datasets is an iterative process, so don’t hesitate to refine and enhance your data over time. After all, great datasets power great RAG models. Ready to make the leap? Curate your perfect dataset and create your first AI Agent with the GenAI Platform today.
The contents of this article are provided for information purposes only.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!