How RAG Works

Retrieval Augmented Generation (RAG) is a strategic AI technique that addresses fundamental limitations of Large Language Models (LLMs), such as their frozen knowledge, tendency to “hallucinate” (confidently lie), and inability to access private company data. RAG is poised to be a significant market, projected to reach over $40 billion by 2035, and is already utilized by approximately 80% of enterprises. Its core promise is to provide LLMs with perfect memory and eliminate hallucinations, effectively turning an LLM into a “real-time research assistant”.

Here’s how RAG works, in the larger context of Retrieval Augmented Generation:

The Core Mechanism of RAG

RAG operates by enabling an LLM to act like it’s taking an “open-book exam” rather than a “closed-book exam,” allowing it to generate answers grounded in real data. The “magic” of RAG can be broken down into three main steps:

Retrieval: This involves searching a designated knowledge base for relevant information.
Augmentation: The retrieved facts are then combined with the original user query.
Generation: Finally, an LLM uses this combined, augmented information to create an answer that is grounded in the real, retrieved data.

Key Components and Processes

Embeddings:
- Text is encoded as numbers (vectors) in high-dimensional space.
- The key insight is that similar meanings will cluster together mathematically in this vector space. For instance, “refund policy” would be embedded as a series of numbers. A common practice is to use 1,536 dimensions for these embeddings.
- It’s crucial to understand that RAG systems do not perform keyword matching; instead, they look for meaning (cosine similarity) and find the nearest neighbors in vector space. For example, a query like “how do I get my money back” would find “refund processing” with high similarity (e.g., 0.95) and “return policy” with slightly lower but still high similarity (e.g., 0.93), while “shipping info” might have low similarity (e.g., 0.38) and not be retrieved.
Chunking:
- Large blocks of text are broken down into smaller pieces, or “chunks,” in a way that helps the LLM understand relationships and semantic meaning.
- Bad chunking can ruin RAG projects.
- There are four main chunking strategies:
  - Fixed size chunks: These can be dangerous as they might cut off mid-sentence.
  - Sentence-based chunks: These respect sentence boundaries.
  - Semantic chunks: These group text by topic.
  - Recursive chunks: These group by hierarchical structure.
- It’s important to plan for overlap between chunks to maximize the chances of the AI finding what it needs in complex situations. The chunking strategy should be driven by the desired business outcome.
Data Preparation (The 10 Steps):
- Good, clean, digital text is essential for a successful RAG system.
- The process for preparing documents involves these steps:
  1. Convert to text using an appropriate parser.
  2. Split into sections.
  3. Remove boilerplate (e.g., headers, footers). PDFs, for example, often have “terrible header and footer pollution”.
  4. Normalize all whitespace.
  5. Extract section titles.
  6. Add metadata (e.g., source, section, date) to each chunk, as this can dramatically improve retrieval accuracy, especially for recency-based queries.
  7. Chunk with overlap.
  8. Embed the chunks.
  9. Verify samples.
  10. Iterate.
- Challenges exist with PDFs, scanned documents (due to OCR accuracy issues), and tables (which require special handling to encode spatial relationships).
Advanced Retrieval and Enhancements:
- Re-ranking: After initial retrieval, information can be re-ranked to significantly boost accuracy for business purposes. This is an advanced technique.
- Hybrid Search (Level 2): This combines both keyword matching and semantic meaning matching, offering better accuracy and sometimes faster results by directly handling keywords, especially for edge cases.
- Multi-Modal RAG (Level 3): For more complex scenarios, RAG can be extended to search across text, images, video, and audio. This requires substantial work on data preparation and chunking strategies for different modalities (e.g., using tools like CLIP for image embeddings) and a unified index across them. An example given is Vimeo’s video search with timestamps.
- Graph RAG: This preserves entity relationships as it encodes information, leading to significantly better retrieval (e.g., LinkedIn’s use of knowledge graphs).
- Search Deep Dive: A hybrid approach that combines vector space search with exact matches or error codes, potentially using a rank choice voting method to find the best combined retrieval answer.
Memory Management:
- One of the fundamental problems RAG addresses is the AI’s limited memory or context window.
- RAG systems can effectively function as an advanced memory manager. They can compress and summarize old parts of a conversation, retrieve previous conversation context with a RAG query on the conversation itself, and maintain multiple abstraction levels to ensure key facts from long-running conversations are not forgotten. This can make an AI seem like it has a larger context window even when the underlying model’s working memory is limited.

Building and Scaling a RAG System

Simple RAG: It’s relatively easy and inexpensive to get started with a simple RAG system for basic Q&A, especially for internal FAQs or manuals. Tools like LlamaIndex (optimized for RAG) and LangChain (a “Swiss Army knife”) can be used, along with various vector databases like Pinecone, Chroma, and Quadrant. A basic Q&A system (Level 1) can be set up in about a week.
Enterprise Scaling: For large-scale enterprise production systems (e.g., handling millions of queries), additional complexities arise:
- Security and Compliance: This involves deep dives into access control, filtering, PII scrubbing, audit trails, and compliance standards like HIPPA, GDPR, and SOC 2.
- Performance and Load Handling: Expectations for speed (often sub-second responses) and handling high query loads necessitate sharding vector databases, replicating data, and caching popular queries.
- Cost Optimization: This includes strategies like cascading models (using different models based on query complexity) and shaving models to use the smallest possible model for efficiency, potentially saving millions of dollars.
- Update Pipelines: Essential for ensuring data remains fresh and does not become stale, which would render the RAG system useless.
- Tracking Embedding Versions: To prevent “garbledegook” due to mismatches between index and query embeddings.

When RAG Goes Wrong (Illustrating “How RAG Works” by its Failures)

Bad implementations of RAG can actually worsen existing problems or introduce new ones:

Memory Problems Worsened: A poorly set up RAG can make it harder for the LLM to retrieve information, causing it to get “lost in the middle”.
Hallucinations: These can still occur with RAG, especially with poorly labeled context.
Incorrect Vector DB Setup: Can be very expensive.
Stale Data: If there’s no update pipeline, data becomes outdated and useless.
Security Leaks: Poor implementation can lead to PII exposure and compliance failures.
Mismatching Embeddings: Using different models for indexing versus querying can lead to “complete garbledegook”.

In essence, RAG addresses the “jagged” nature of LLMs—their knowledge cutoff dates, tendency to hallucinate, and inability to access proprietary data—by providing a structured way to retrieve, augment, and generate responses based on up-to-date, accurate, and relevant information. It allows companies to bridge their internal data with AI models, enabling AI to drive workflows forward.

Categories:

AI Applications and Solutions

Tags:

No Tag