Full Stack Developer interested in making lives better through software
Updated Nov 4, 2024
Retrieval-Augmented Generation (RAG), an approach to supplementing the knowledge available to a large language model (LLM) is gaining significant momentum in the AI engineering field. By enabling LLMs to access and incorporate external knowledge sources, RAG allows for more accurate results and helps fix common issues like AI hallucinations. While multiple third-party services now offer RAG solutions to accelerate LLM application development and new startups rapidly enter the space, relying on these services may not meet everyone’s unique requirements for data parsing, retrieval, and privacy concerns.
In this article, I’ll walk you through building a custom RAG pipeline using LlamaIndex, Llama 3.2, and LlamaParse. We’ll start with a simple example and then explore ways to scale and optimize the setup.
All the code for this article is available in our GitHub repository.
Note: You’ll need to provide your own API keys for the following services (all of which offer generous free plans, sufficient to follow along in this article):
Data:
For this project, we are using the 2025 Hyundai Tucson Hybrid user manual as our source PDF.
First, we need to process and load our data. We’ll use LlamaParse, a document parsing service from LlamaCloud. LlamaParse can handle complex documents, including PDFs, Word documents, and more. It also offers multimodal parsing capabilities and integrates seamlessly with MongoDB Atlas Vector Database, making it easier for a more robust, production-ready implementation in the future.
from llama_parse import LlamaParse
documents = LlamaParse(result_type="markdown")
.load_data("./data/2025_Tucson_Hybrid_user_manual.pdf")
As you can see parsing documents is as simple as 1 line of code. The variable documents contains the parse chunks of the pdf file we loaded.
To make the data usable for similarity search during query time, we need to convert the text into embeddings. The actual embedding process is done in the following steps. Here we're only loading our custom embed model.
Embeddings are numerical representations of text that capture semantic meaning. They allow the model to understand the context and relevance of different text chunks.
We’ll use the sentence-transformers/all-MiniLM-L6-v2 model from Hugging Face for this purpose:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
By generating embeddings, we enable the system to perform similarity searches between user prompts and the document content.
Next, we’ll choose the LLM that will generate responses based on the retrieved information. We’ll use llama-3.2, accessed via the Groq API:
from llama_index.llms.groq import Groq
llm = Groq(
model="llama-3.2-3b-preview",
api_key=GROQ_API_KEY
)
We need to inform LlamaIndex about the LLM and embedding models we’re using:
from llama_index.core import Settings
Settings.llm = llm
Settings.embed_model = embed_model
Note: By default, LlamaIndex uses OpenAI models for LLM and embeddings, so specifying custom models is required if you are using other providers.
With our data parsed, we’re ready to create a vector store index. A vector store index in LlamaIndex organizes the document into embeddings and handles retrieval of relevant information during queries.
from llama_index.core import VectorStoreIndex
vector_index = VectorStoreIndex.from_documents(
documents,
show_progress=True
)
query_engine = vector_index.as_query_engine()
By converting the index into a query engine, we can now ask questions that tap into our custom data source:
res = query_engine.query(
"Under what circumstances may the navigation-based smart cruise control not " +
"operate properly for the 2025 Tucson Hybrid?"
)
from IPython.display import Markdown, display
display(Markdown(res.response))
Without RAG, Llama 3.2 can provide a general answer, but it’s limited to what it has learned from its training data. The response might seem relevant but won’t be the most accurate based on the specific manual. With our custom RAG pipeline, however, the LLM delivers:
The navigation-based Smart Cruise Control may not operate properly
for the 2025 Tucson Hybrid under the following circumstances:
- The navigation is not working properly
- Map information is not transmitted due to infotainment
system's abnormal operation
- Speed limit and road information in the navigation is
not updated
- The map information and the actual road is different
because of real-time GPS data or map information error
- The navigation searches for a route while driving
- GPS signals are blocked in areas such as a tunnel
- A road that divides into two or more roads and joins
again
- The driver goes off course from the route set in
the navigation
- The route to the destination is changed or cancelled by
resetting the navigation
- The vehicle enters a service station or rest area
- Android Auto or Car Play is operating
- The navigation cannot detect the current vehicle position
(for example, elevated roads including overpass adjacent
to general roads or nearby roads exist in a parallel way)
- The navigation is being updated while driving
- The navigation is being restarted while driving
- The speed limit of some sections changes according to
the road situations
- Driving on a road under construction
- Driving in lane-restricted driving situations
- There is bad weather, such as heavy rain, heavy snow,
etc.
- Driving on a road that is sharply curved
This is the most accurate answer.
So far, our pipeline stores data in volatile memory. For a more robust and efficient approach, it’s better to persist the vector index to disk so we can load it later without re-indexing the documents. This is especially useful for larger datasets or when the application needs to be restarted.
PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
documents = LlamaParse(
result_type="markdown"
).load_data("./2025_Tucson_Hybrid_user_manual.pdf")
vector_index = VectorStoreIndex.from_documents(documents)
# Store Vector Index
vector_index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
# load the existing Vector Index
storage_context = StorageContext.from_defaults(
persist_dir=PERSIST_DIR
)
vector_index = load_index_from_storage(storage_context)
query_engine = vector_index.as_query_engine()
res = query_engine.query("under what circumstances the navigation-based smart cruise control may not operate properly for the 2025 Tucson Hybrid?")
display(Markdown(res.response))
While this example demonstrates how to create a custom RAG pipeline, a production-ready application requires additional enhancements to improve performance, scalability, and robustness.
Storing embeddings in a vector store allows for efficient similarity searches and better scalability. Vector databases are optimized for handling high-dimensional data and can manage large volumes of embeddings. Fortunately LlamaParse allows for integrations with third-party vector stores like MongoDB Atlas Vector Database.
Connect a file storage system (e.g., AWS S3) to automate data ingestion. A service can trigger ingestion, parse with LlamaParse, generate embeddings, and update the vector store whenever new files are added, keeping data current without manual steps.
We didn't cover how to configure chunk size and node parsing, however, as explained in LlamaIndex's guide for Building Performant RAG Applications for Production, to enhance retrieval accuracy, use smaller chunks for embedding and retrieval, then larger chunks or full context for synthesis:
Retrieval Chunks: Use smaller, more focused chunks or summaries for embedding. This improves the likelihood of retrieving the most relevant pieces of information.
Synthesis Chunks: Once relevant documents are retrieved, use larger chunks or the full context for the LLM to generate a comprehensive response.
This approach addresses the issue where the optimal chunk size for retrieval may not be the same as that for synthesis, leading to better overall performance.
As the dataset grows, simple vector similarity searches might not be enough. Implementing structured retrieval techniques can improve accuracy:
Metadata Filtering: Tag documents with metadata (e.g., date, author, category) and use these tags to filter search results.
Hierarchical Retrieval: Use summaries of documents or sections to perform initial retrieval before drilling down to specific chunks.
These techniques help in narrowing down the search space and retrieving more relevant information more efficiently.
In this article, we explored how to create a custom RAG pipeline using LlamaIndex, Llama 3.2, and LlamaParse. By incorporating external data sources, we improved our LLM’s ability to deliver precise answers while helping to mitigate common issues like AI hallucinations, all without the need for fine-tuning.
While our example is a good starting point, moving to a production-ready application will require additional steps, such as using vector stores and optimizing retrieval strategies.
Can we help you apply these ideas on your project? Send us a message! You'll get to talk with our awesome delivery team on your very first call.