By providing it more context – Adding Contextual Retrieval to your RAG Models
Generative AI has reached a level where it excels at many fundamental tasks better than humans.However, as AI continues to evolve, a key question remains: However there are still some limitations in LLMs including a tendency toward hallucinations, challenges with arithmetic, and limited interpretability. So how do we go about improving this further for our own personal as well as enterprise use-cases? One promising approach to address these limitations is Retrieval-Augmented Generation (RAG), particularly multimodal RAG, which enables AI models to interact with external knowledge in various formats, enhancing both the factuality and rationality of the generated content.
Multimodal RAG extends beyond traditional text-only retrieval methods by allowing models to incorporate images, videos, and other data types. Recently, models such as Meta’s Llama 3.2 have demonstrated the ability to effectively combine multimodal inputs for tasks like image-text generation, open-domain question answering, and even code summarization. This article explores these advancements and explores new techniques such as Contextual Retrieval by Anthropic and graph-based methods in this evolving field.
RAG vs. Fine-Tuning: Which One to Choose?
When enhancing the capabilities of language models for specific tasks, two common approaches are Retrieval-Augmented Generation (RAG) and fine-tuning. Both methods have their benefits and limitations, and understanding when to use each is crucial for optimizing AI performance.
Retrieval-Augmented Generation (RAG)
RAG leverages external knowledge sources to enrich the model’s responses in real time. By using retrieval mechanisms that fetch relevant information from a pre-existing knowledge base (such as documents, databases, or multimedia files), RAG enables the model to generate more accurate and contextually relevant outputs without modifying its core architecture. This is especially useful in scenarios where:
- Knowledge changes frequently: RAG allows dynamic updating of the knowledge base, making it ideal for applications like customer support, where information about products or services evolves rapidly.
- Handling multimodal data: Advanced RAG models can process a variety of inputs, including text, images, and even audio, to generate richer, more context-aware responses.
- Avoiding costly retraining: Since RAG works by augmenting the model’s outputs with retrieved data, it avoids the computationally intensive process of fine-tuning, allowing for real-time updates without retraining the model.
Fine-Tuning
Fine-tuning, on the other hand, involves training a pre-existing model on a specific dataset to tailor its responses to a particular domain or use case. During this process, the model’s parameters are adjusted based on the new data, embedding the specialized knowledge directly into the model. This approach is best suited when:
- Highly specialized knowledge is required: Fine-tuning a model on a niche dataset, such as legal documents or medical records, can help the model internalize domain-specific language and nuances.
- Consistency in responses: Since fine-tuning embeds information into the model’s weights, the model consistently generates responses that align with the fine-tuned data.
- Limited need for real-time updates: Fine-tuning is a more static approach, as updating the model’s knowledge base requires retraining. Therefore, it’s more effective in domains where information doesn’t change frequently.
When to Use Each Method
- For dynamic environments: RAG is more suitable due to its ability to pull in fresh, external data. For example, Spotify’s Customer Service AI Assistant uses RAG to handle changing customer queries by accessing an updated knowledge base.
- For static, domain-specific tasks: Fine-tuning is the way to go, especially for specialized applications like legal research or medical diagnostics, where the model benefits from being trained on a carefully curated dataset.
In some cases, a hybrid approach can be beneficial. For instance, an initial fine-tuning phase can give the model a strong domain-specific foundation, while RAG can handle the dynamic, real-time retrieval of new information. This combined strategy can maximize the model’s accuracy and flexibility across a range of applications.
Level One: Enhancing AI with Basic Retrieval, Prompt Caching, and Benchmarking
Before exploring complex retrieval techniques, it’s essential to understand the foundational methods for improving AI responses.
1. Database Prompts: Adding Knowledge Directly into Prompts
The simplest way to augment an AI model’s knowledge is by incorporating a structured database directly into the prompts. This method allows the model to access relevant information embedded within the prompt itself. For smaller knowledge bases—less than 200,000 tokens or about 500 pages of material—directly including the database in the prompt can be sufficient. This approach eliminates the need for complex retrieval mechanisms, enabling quick and accurate responses.
2. Prompt Caching: Optimizing Response Time and Costs
When dealing with frequently used queries, prompt caching becomes an essential optimization technique. This approach involves storing the outputs of commonly used prompts, which significantly reduces both latency and computational costs. Claude recently implemented prompt caching, reducing latency by over 2x and costs by up to 90%. This allows developers to store and reuse processed prompts efficiently, especially beneficial for use cases requiring real-time responses.
3. Benchmarking: Establishing Performance Baselines
Benchmarking is vital for assessing the effectiveness of retrieval and generation in AI models. Before deploying advanced retrieval methods, it’s crucial to establish baselines using datasets like SQuAD, WikiQA, or newer evaluation sets like WebQA and MultimodalQA. Use benchmarks to evaluate how well your AI model performs with direct database prompts and caching.
Level Two: Traditional RAG, Graph-Based Retrieval, and Contextual Retrieval
RAG is commonly used to handle large-scale knowledge bases by preprocessing documents, converting them into vector embeddings, and storing them in a vector database. However, the Contextual Retrieval method introduced by Anthropic enhances traditional RAG by preserving the context of each chunk, mitigating the issue of context loss.
Link to full post – https://www.anthropic.com/news/contextual-retrieval
Contextual Retrieval
Contextual Retrieval improves the retrieval accuracy by appending concise, chunk-specific context to each piece of information before embedding it. This method leverages Contextual Embeddings and Contextual BM25 to reduce retrieval failure rates. The addition of reranking ensures that only the most relevant content is passed to the model, refining the retrieval process.
Technical Implementation:
- Contextual Embeddings: Enhances each text chunk by prepending an explanatory context that situates it within the overall document, improving the retrieval of specific information.
- Contextual BM25: Uses lexical matching to identify exact phrases or technical terms, refining the retrieval process further.
- Combination and Reranking: Combines results from embeddings and BM25, reranking them to maximize retrieval accuracy.
When combined, these techniques significantly improve retrieval accuracy, as Anthropic’s experiments showed a 67% reduction in retrieval failure rates. This method supports use cases like customer support, technical documentation, and complex Q&A systems, highlighting its versatility.
Graph-Based Retrieval in RAG
Graph-Based Retrieval in RAG uses knowledge graphs to capture relationships between entities, facilitating complex information retrieval. This approach incorporates Graph Neural Networks (GNNs) to model dependencies within the graph, providing richer, contextually informed results.
Example: Klarna employs graph-based RAG for automating customer support. By maintaining a knowledge graph that encompasses products, transactions, and interactions, Klarna’s system can retrieve contextually appropriate responses, such as navigating customer queries about a specific product’s return policy.
Level Three: Multimodal Retrieval-Augmented Generation
Multimodal RAG extends traditional retrieval-augmented generation methods by integrating multiple forms of data, such as text, images, audio, and video, to produce more contextually rich and accurate responses. This capability is especially vital for applications like visual question answering, document analysis, and complex customer support interactions. While early models like MuRAG helped establish the foundations of multimodal RAG, newer models such as Llama 3.2, Qwen-VL, and Gemini Pro Vision have significantly advanced the field by setting new benchmarks in processing diverse input formats.
1. Llama 3.2: Benchmark Leader in Multimodal RAG
Llama 3.2, developed by Meta, has become one of the most versatile models for handling both text and visual inputs efficiently. It comes in various sizes, with the larger variants (11B and 90B parameters) optimized for multimodal capabilities.
- Highlights: Llama 3.2 integrates text and visual data to understand complex queries and provide detailed answers. Its advanced vision encoder works seamlessly with text inputs, making it highly effective for tasks such as visual question answering (e.g., interpreting charts, images) and document analysis (extracting and processing information from scanned documents).e.
2. Qwen-VL: A Multimodal Powerhouse
Qwen-VL from Alibaba is another top performer in multimodal RAG, known for its exceptional handling of text and image data in varied contexts.
- Highlights: Qwen-VL employs an adaptive cross-modal attention mechanism, which allows it to dynamically focus on the most relevant parts of the input, whether they are textual or visual. This adaptability makes it highly efficient at analyzing product images, technical manuals, and other complex visual data.It is widely used in e-commerce for product image analysis, customer support for retrieving image-based guides, and education for generating visual content based on text queries. Its scalable retrieval system enables it to efficiently search large databases and identify relevant image-text pairs, enhancing the quality of responses in real-world applications.
Real-World Applications of RAG
Retrieval-Augmented Generation (RAG) has found its way into various industries, transforming how organizations manage knowledge, support customers, and conduct research.
1. Corporate Knowledge Management – Goldman Sachs Developer Assistant
Goldman Sachs has implemented a sophisticated RAG system called the Goldman Sachs Developer Assistant (GS DA) to help developers efficiently navigate over 500 million lines of internal code and documentation. By utilizing a combination of ElasticSearch and dense retrievers, this multimodal RAG system provides developers with examples and guides, cutting down the time required for code understanding and debugging by 50%.
- Technical Implementation: Goldman Sachs employs a pipeline that preprocesses and indexes code snippets and related documentation. They use a mix of sparse (BM25) and dense retrievers to ensure both semantic and lexical matches.
- Key Results: The system has reduced developer troubleshooting time and enhanced overall productivity. According to their 2023 earnings call, Goldman Sachs attributes a notable increase in developer efficiency to this RAG-based assistant.
2. Customer Support Enhancement – Spotify’s Customer Service AI Assistant
Spotify has deployed a Customer Service AI Assistant using RAG to handle over 100,000 customer inquiries daily. This system enhances the support process by retrieving relevant information from an extensive knowledge base using semantic search, which allows it to understand and address the context of customer queries, even when there aren’t direct keyword matches.
- Technical Insights: Spotify’s RAG implementation processes vast amounts of customer data using multimodal inputs like text and images. The system integrates both dense retrieval models for semantic search and reranking techniques to ensure the accuracy of retrieved information.
- Performance Metrics: Spotify has reported a 37% reduction in average response time and a 25% improvement in first-contact resolution rate. Details of this implementation can be found on Spotify’s engineering blog.
3. Healthcare Applications – Mayo Clinic’s AI-Assisted Research
Mayo Clinic has integrated RAG into their medical research process, indexing over 30 million documents to help physicians access relevant research quickly. The system, which also integrates with the Epic electronic health record system, allows physicians to query both text and visual data such as diagnostic images.
- Technical Approach: Mayo Clinic uses a multimodal RAG system that combines text-based research papers and visual medical imaging. Future developments include the incorporation of medical imaging for a more comprehensive retrieval system.
- Results: The RAG system reduced the time physicians spend searching for research by 40% and increased their confidence in research-backed decisions by 28%.
4. Legal Research – Thomson Reuters Westlaw Edge
Thomson Reuters’ Westlaw Edge platform uses RAG to assist lawyers in finding relevant cases, statutes, and legal precedents more efficiently. By indexing over a century’s worth of case law, the system processes over a million legal queries daily and is used by 91 of the top 100 US law firms.
- Technical Details: The platform employs a hybrid retrieval mechanism, using both sparse (e.g., BM25) and dense retrieval models to handle the complexity of legal documents.
- Results: A Stanford Law School study found that attorneys using RAG-enhanced tools were 30% faster in finding relevant precedents and 45% more confident in the comprehensiveness of their research.
5. E-Commerce Personalization – Klarna
Klarna utilizes a graph-based RAG system to manage complex customer interactions, leveraging a knowledge graph that includes products, customer behaviors, and transaction details. This multimodal approach allows the system to recommend products, provide personalized customer support, and even handle returns and refunds.
- Technical Highlights: Klarna’s RAG system integrates image data (product photos) and textual information (product descriptions, reviews) using dense retrieval mechanisms. Knowledge graphs provide a structured context, enhancing the AI’s ability to deliver personalized customer experiences in real time.
- Impact: The system has improved customer satisfaction by tailoring interactions to individual preferences, boosting engagement and conversion rates.
6. Document Management – Dropbox
Dropbox has implemented a multimodal RAG system to manage various file types, such as text documents, images, and PDFs. By converting these files into a searchable structure, Dropbox has streamlined document retrieval, making it easier for users to find the information they need across different formats.
- Implementation Details: Dropbox’s system uses a combination of visual and text retrieval models to index document contents. It can interpret graphical elements within documents, such as tables and charts, making it especially useful for professionals managing large datasets.
- Outcome: Enhanced document search capabilities have significantly improved user productivity by allowing for fast, accurate retrieval of mixed-format data.
Conclusion
In conclusion, Retrieval-Augmented Generation (RAG), particularly its multimodal forms, has become a transformative force in AI, significantly enhancing the accuracy and contextual relevance of generated content. By incorporating methods like contextual retrieval and leveraging advanced models such as Llama 3.2 and Qwen-VL, RAG is now effectively addressing limitations of traditional language models, including hallucinations and context loss. Real-world implementations across industries demonstrate RAG’s versatility and impact. As RAG technologies continue to evolve, integrating more diverse data types and refined retrieval mechanisms, they will further expand AI’s capabilities, paving the way for even more intuitive, context-aware, and intelligent applications.