RAG (updating)

Writing a blog post to record experiences and optimization ideas for RAG (Retrieve-and-Generate architecture) that is currently being used in the team, which is an early preview. Let’s start by discussing why we chose to use RAG.

Why RAG ?

After the popularity of large models, some companies found that their existing large models could not answer some company-specific content. As a result, they created knowledge bases tailored to large models, using internal corporate background knowledge to enhance the ability of large models to provide personalized answers.

However, this raises issues of information leakage. As a result, most companies opt for private deployment models to solve this problem, which is what we call enterprise knowledge graphs with private deployment.

If without RAG, LLM is only knowledge source

Without RAG	With RAG
Hallucination	Source data reference
Outdated info	Latest info
Knowledge blind spot	Covers all search engine information
Doesn’t cover my data	Covers my / public data

RAG Paradigm

Retrieval-Augmented Generation for Large Language Models: A Survey (click me to view the paper)

Using Naive RAG

AKA Retrieve-then-read Architecture

The most classic RAG structure steps are as follows: Establish index –> Retrieve –> LLM Combining Generate Information to Answer Questions

Establish Index
- Generally, this step is performed offline. The document data is cleaned and divided into chunks, and an embedding model is used to generate vectors to create an index (which is then stored in a database)
Retrieve
- Once the KB is established, user could submit a query, we will use the previous embedding model to generate a vector and perform similarity matching with the KB. Select the top k similar results as the enhancement information for the current question.
Combine LLM
- Based on the above information, generate a prompt and combine it with the LLM to answer the user’s query.

See more details are as follows:

Local Documents   ---> Unstructed Loader    --> Text ---> Text Splitter ---> Text Chunks
                                                                                |
                                                                                |
                                                                                v
                                                                            Embedding
                                                                                |
                                                                                |
                                                                                v
Query      --->     Embedding --->  Query Vector --->  Vector Similarity <--- VectorStore
                                                                |
                                                                |
                                                                v
                        Prompt  <---  Prompt Template  <--- Related Text chunks
                          |          
                          |
                          v
            Answer <---  LLM

                                                                                        // Naive RAG

However, such a structure never quite meets expectations and is prone to “garbage in, garbage out.” What’s wrong?

Unable to efficiently match with KB. Possible reasons:
- Inefficient queries: whether intentional or unintentional, users may submit queries containing a lot of irrelevant information or fail to highlight their core concerns, causing the topic to deviate.
- When establishing the index, unable to retrieve core knowledge
LLM’s well-known problems:
- Unable to retrieve relevant knowledge or low-quality knowledge leads to hallucinations - self-generated text that is meaningless in RAG.
- LLM can only reproduce KB information and cannot add more insightful or comprehensive information.

Build Advanced RAG

Preprocessing

Storage

PostgreSQL + pgvetcor

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are ‘most similar’ to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.

Parser Augmentation

The success of RAG implementation hinges on whether the document interpretation meets expectations. Only by thoroughly extracting and interpreting the information within documents can a high-performance knowledge base be built to support LLMs. The first step, and arguably the most crucial one, is to optimize the document parsing process.

Let’s take a typical PDF file as an example and compare the strengths and weaknesses of several mainstream PDF parsing effects:

Direct Text Extraction: PyMUPDF, PyPDF, PDFMiner, etc.
- As a common Python PDF loader, these tools have the advantage of being relatively mature, low latency, and suitable for fast text extraction from PDF or documents.
- However, they are not effective in handling complex document structures, such as tables, images, and layouts, which may not meet our expectations for RAG document parsing effects.
OCR Conversion and Text Extraction: Tesseract, Textract, etc.
- These tools are relatively mature and can capture document structure and layout.
- However, they have high latency, and the performance varies depending on specific use cases.
Intelligent Document Parsing: LlamaParse, Unstructured.io, AzureDoc Intelligence, etc.
- These tools can parse complex document structures and convert them into Markdown format (or JSON, etc.).
- However, they are relatively immature, have limited scalability, and high latency.

Query Translation

The problem is, the chunk of information where the answer lies, might not be similar to what the user is asking. The question can be badly written, or expressed differently to what we expect. And, if our RAG app can’t find the information needed to answer the question, it won’t answer correctly.

In fact, semantic search iom embedding is hard to get right. Embedding long documents is a challenge. User queries are a challenge, if a user provides an ambiguous query they’ll get ambiguous matches.
LLMs just follow what in the context and hallucinate answers as a result.

Decomposition: If a question contains multiple independent sub-questions, split it and execute each one independently.
- IR_CoT (Information Retrieval for Contextual Text)
- Least-to-Most (click me to view the paper): process least relevant information first
Enhancing Queries: If the retrieval method is sensitive to query content, generate multiple versions of the query to retrieve more relevant content.
- HyDE (click me to view the paper)
  - HyDE circumvents the aforementioned learning problem by performing search in documentonly embedding space that captures documentdocument similarity.
- Step-back prompting
  - Few shot prompt to produce more abstract step_back question

We can also use a Generative AI model to rewrite the question. This model could be a large model, like (or the same as) the one we use to answer the question in the final step. Or it can also be a smaller model, specially trained to perform this task.

Rewriting Queries:
- Query2doc: Query Expansion with Large Language Models (click me to view the paper): uses LLMs to expand queries
- RAG-Fusion: combines multiple queries or documents to form a new query
Trainable Rewriter:
- We can fine-tune a pre-trained model to perform the query rewriting task. Instead of relying on examples, we can teach it how query rewriting should be done to achieve the best results in context retrieving. Also, we can further train it using Reinforcement Learning so it can learn to recognize problematic queries and avoid toxic and harmful phrases. Or we can also use an open-source model that has already been trained by somebody else on the task of query rewriting.

More abstraction    ^           Step-back question (Step-back prompting)
                    |                      ^
                    |                      |
                    |                      |
                    |                      |
                    |                  Question    ----->  Re-written (RAG-Fusion, Multi-Query)
                    |                      |
                    |                      |
                    |                      |
                    |                      v
Less abstraction    v              Sub-question (Least-to-Most)

Indexing

RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval)

Stanford, 31 Jan 2024 (click me to view the paper)

We show state-of-the-art results; for example, by coupling RAPTOR retrieval with the use of GPT-4, we can improve the best performance on the QuALITY benchmark by 20% in absolute accuracy

RAG Optimization