Do Larger Context Windows Remove the Need for RAG?

 

Stanley Jovel

Full Stack Developer interested in making lives better through software

Updated Jul 2, 2024

In the rapidly evolving field of AI, the Retrieval-Augmented Generation (RAG) design pattern has emerged as a solution to enhance the accuracy and relevance of responses generated by Large Language Models (LLMs). However, new models like ChatGPT-4o and Gemini 1.5 Pro can handle much larger amounts of information in a single prompt. This raises the question: Can we skip RAG by just adding all the needed context into these larger prompts? If so, what implications does this have for performance and cost?

The code for this article can be found here: https://github.com/generalui/openai-api-benchmark/

Objective and Scope

In this article, I evaluate two different approaches to grounding generated content using OpenAI's ChatGPT-4o model. I will use company documents to provide context, aiming to transform ChatGPT into an expert on the internal processes of GenUI. The primary goal is to ensure that the generated answers from the model are more contextually relevant and factually accurate when prompted about GenUI specifics.

Methodology

Question Selection

A set of 40 questions relevant to GenUI's internal processes was selected from the new employee handbook and onboarding deck.

Method 1: Inline Context

This approach involves appending all relevant context directly into the model's prompt. The steps are:

  • Parse PDFs files into text and append all content directly into the prompt.
  • Generate answers from this single extensive prompt.
  • Record the generated answers to each of the 40 questions.

Method 2: RAG (Vector Store Retrieval)

This method utilizes OpenAI's vector stores to upload and create embeddings of custom files. This method embodies the principles of RAG, integrating retrieval mechanisms to augment the generation process. The steps are:

  • Upload the handbook and onboarding deck to OpenAI's vector stores and create embeddings.
  • Query the vector stores for each question to retrieve relevant information.
  • Generate answers using the retrieved embeddings.
  • Record the generated answers to each of the 40 questions.

Data Analysis

To compare the effectiveness of the inline context and vector DB approaches, BLEU and ROUGE scores are used as evaluation metrics. These metrics quantify the accuracy and relevance of responses by comparing generated answers to reference answers.

  • The BLEU score is a measure of the precision of retrieved answers, indicating how many words match with the reference answers.
  • ROUGE scores measure recall, indicating how well the retrieved answers cover the reference answers.

By employing these metrics, the study aims to quantify the accuracy and relevance of responses generated using both methods.

Results

For an in-depth look at how the results were obtained, you can review the Jupyter notebooks hosted in our research repository.

Performance comparison

Table 1. Performance comparison

Model Inline Context Vector DB
BLEU Score 0.265685 0.227175
ROUGE-1 0.45494 0.425944
ROUGE-2 0.304204 0.274646
ROUGE-L 0.396724 0.334743

As seen in Table 1, the inline context method consistently outperformed the vector DB method across all metrics, suggesting better retrieval of accurate and relevant answers. However, the difference is slight, suggesting both methods are close in overall performance and could be viable options depending on specific use cases and requirements. Interestingly, while both methods produce the correct answers, the Vector DB method tends to provide more verbose answers overall which might affect the BLEU and ROUGE scores. Take question 15 “What are the paid holiday benefits provided by GenUI? As an example, both results for method 1 and method 2 focused on the holidays rather than the benefits, with the vector DB method producing a longer answer. A verbose answer is not necessarily a wrong answer, after all the level of verbosity is dictated by your business needs but is worth mentioning, as this affects the scores.

BLEU score comparison chart

Figure 1. BLEU score comparison

ROUGE score comparison

Figure 2. ROUGE score comparison

Cost comparison

Table 2. Cost comparison

Model Inline Context Vector DB
Avg. Prompt Tokens 20274 16884
Avg. Input Cost $0.101 $0.084
Avg. Completion Tokens 146 218
Avg. Output Cost $0.0022 $0.0033
Avg. Total Tokens 20420 17102
Avg. Total Cost $0.103 $0.088

Cost calculation based on Open AI pricing.

The cost comparison reveals that the Inline Context method incurs a higher cost due to the extensive prompt size required to include all relevant context. Interestingly, the Vector DB method uses a lot more input tokens than one might expect, with the inline method using on average 20% more input tokens. This suggests that at this level of input tokens (~30K per prompt), using OpenAI's vector stores to retrieve relevant context is still adding to the prompt about 80% of the total context. Which does not sound like a compelling benefit if we extrapolate this trend with larger input token counts.

Conclusion

For use cases where the context information is small (within 30K tokens), it seems that if the priority is accurate responses, it is better to append the relevant context directly in a single prompt albeit this approach comes with increased API usage costs.

However, the cost-efficiency of the Vector DB method is still significant, especially for applications where cost is a critical factor. Considering that in terms of accuracy as shown in Table 1, you are guaranteed to get results almost as good as the inline context method with a 20% reduction in API costs.

Future Exploration

This experiment does not show a promising benefit between OpenAI's vector stores against using inline context directly. This opens up the possibility to explore much wider input token thresholds.

If we were to increase the context size to 50K, 100K or even 150K tokens, would OpenAI's vector stores still retrieve and append 80% of the total context into our prompts or will it plateau elucidating the benefits of a vector store in the first place?

On the next step of this research we shall increase the amount of tokens to find out how much better it is to insist on a RAG solution by OpenAI's vector stores or if sending all context on a single prompt remains a valid approach to get ChatGPT-4o to generate company specific answers.

How can we help?

Can we help you apply these ideas on your project? Send us a message! You'll get to talk with our awesome delivery team on your very first call.