LLM Word of the Week: RAG
RAG — when LLMs don’t just “know,” they also go look up info to give better answers.
RAG (Retrieval‑Augmented Generation) is a technique where an LLM can fetch relevant external information (from a database, document store, or knowledge source) and use that to generate more accurate, up-to-date responses.
What RAG does (TL;DR)
Instead of relying solely on its internal “memory,” a model with RAG retrieves external documents or facts, then generates a response grounded in that retrieval. This combination reduces hallucination, allows updates without full retraining, and supports domain-specific systems.
Why RAG changes the game
Plain LLMs are limited by their training cutoff and internal weights. With RAG:
- Errors from out-of-date knowledge drop dramatically
- You can plug in your own documents, internal knowledge base, or domain-specific sources
- It scales—you’re not retraining the entire model every time your data changes
Key components & design choices
- Retriever / index: The system that finds relevant docs (vector search, sparse search, hybrid)
- Prompt / generator: The LLM that uses retrieved documents + your question to build an answer
- Re-ranking or filtering: Often you filter or rerank retrieved results so the LLM only sees high-quality evidence
- When to retrieve: Some architectures decide “should I fetch documents or just trust the model?” (adaptive / conditional retrieval)
Trade-offs & challenges
- Blending retrieval and generation introduces latency and complexity
- Retrieval errors or missing docs can still lead to hallucinated output
- If the retriever is weak, the generator may “hallucinate filler”
- Maintaining and indexing large document stores is resource-intensive
Putting it into practice (tips)
- Start with a solid vector search system (FAISS, Pinecone, Weaviate)
- Use prompt engineering to condition the model to cite or incorporate retrieved context
- Monitor hallucination by comparing generated claims against retrieved evidence
- Experiment with adaptive retrieval — only fetch when necessary
- When adding new data, update indexes frequently rather than retraining model
Final thought
RAG bridges memory and reasoning: it lets models say, “I don’t know, but I’ll find the answer.” In doing so, it significantly reduces the burden on models to internalize everything, while giving them grounded context to reason from.
See this notebook for an example implementation of RAG: https://www.kaggle.com/code/princedemo/using-rag-to-improve-prompt-response