Efficient and supercharged RAG for mixed context texts with Indexify's framework, Gemini's 1M context & Arctic's embeddings¶

Introduction¶

Retrieval-augmented generation (RAG) systems have emerged as a groundbreaking approach in natural language processing, enabling the generation of accurate and contextually relevant responses by leveraging external knowledge. These systems have the potential to revolutionize various applications, from question answering and content generation to dialogue systems and beyond. However, despite their immense promise, modern RAG systems face a significant challenge when it comes to efficiently processing large mixed context texts.

In this blog post, we delve into the intricacies of this problem and explore how Indexify, an Open Source data framework, has developed an innovative pipeline to overcome these limitations. We will discuss the challenges posed by mixed context texts, the shortcomings of existing chunking algorithms, and how Indexify's solution combines advanced data extraction, intelligent text restructuring, enhanced chunking, and state-of-the-art embedding creation to deliver highly efficient and accurate RAG systems.

The Complexity of Mixed Context Texts¶

Mixed context texts, such as research papers, technical documents, or even web pages, often contain a diverse range of information spanning multiple domains. For instance, a single document might include random sentences from physics, chemistry, biology, and computer science, scattered throughout its content. This heterogeneous nature of the text poses a significant challenge for RAG systems, which rely on identifying and retrieving relevant information to generate accurate responses.

When a user asks a question related to a specific topic, such as the various systems of the human body, RAG systems need to efficiently locate and extract the relevant information from the mixed context text. However, popular chunking algorithms, like LangChain's RecursiveCharacterTextSplitter, struggle to handle such texts effectively.

These algorithms typically create chunks based on a fixed number of characters or tokens, without considering the semantic coherence of the sentences within each chunk. As a result, the generated chunks often contain a lot of unnecessary information from unrelated domains, as sentences from different topics are placed together haphazardly. This leads to a waste of precious tokens when these chunks are passed to subsequent API calls to Large Language Models (LLMs), which are often limited in their context length.

Moreover, if the mixed context text mentions four different systems of the human body at different places, a similarity search with a top-k value of 2 would fetch at most two relevant chunks. This limitation sacrifices the quality of the produced output when these chunks are passed as context to an LLM in subsequent API calls, as the model may not have access to all the necessary information to generate a comprehensive response.

LLM Assisted Restructuring for RAG (LLMARRAG) Pipeline¶

At Indexify, we recognized the need for a more efficient and accurate approach to processing mixed context texts in RAG systems. Our team has developed an innovative pipeline that combines cutting-edge technologies and techniques to overcome the limitations of existing solutions.

Step 1: Data Extraction with Robust Extractors¶

The first step in Indexify's pipeline is to extract data, such as text, from various sources like PDF files and other documents. We understand that unstructured data poses a significant challenge, which is why we have developed a fast real-time extraction engine and a collection of robust pre-built extractors.

One notable integration in our pipeline is Vik Paruchuri's Marker, a powerful tool for extracting structured data from unstructured sources. By leveraging Marker, we ensure that we can comprehensively extract text data from a wide range of documents, providing a solid foundation for the subsequent steps in our pipeline.

Step 2: Intelligent Text Restructuring with Gemini 1.5 Flash LLM¶

Once the text data is extracted, the next crucial step is to restructure it in a way that facilitates efficient processing and retrieval. This is where Indexify's pipeline truly shines, as we leverage Google's state-of-the-art Gemini 1.5 Flash LLM, which was recently unveiled at Google I/O 2024.

The Gemini 1.5 Flash LLM boasts an impressive 2M context length, making it exceptionally well-suited for processing large mixed context texts. By harnessing the power of this advanced language model, we can intelligently restructure the entire text of a PDF or other document, grouping sentences from similar topics together.

This semantic restructuring is a game-changer, as it ensures that related information is placed in close proximity, creating topic-coherent segments within the text. By bringing together sentences that discuss the same subject matter, we lay the groundwork for more accurate and efficient chunking in the subsequent steps of our pipeline.

Step 3: Enhanced Chunking with RecursiveCharacterTextSplitter¶

With the text restructured into topic-coherent segments, Indexify's pipeline proceeds to perform chunking using the RecursiveCharacterTextSplitter algorithm. This algorithm has been specifically designed to handle large texts and create meaningful chunks based on a specified maximum chunk size.

Thanks to the intelligent restructuring performed in the previous step, the RecursiveCharacterTextSplitter can now generate chunks that are more information-dense and focused on specific domains. This enhanced chunking process greatly improves the efficiency of RAG systems by providing chunks that are highly relevant to the question at hand.

By eliminating the inclusion of unnecessary information from unrelated domains, our pipeline saves valuable tokens in subsequent API calls to LLMs. This optimization ensures that the LLMs receive only the most pertinent information, enabling them to generate accurate and contextually relevant responses without wasting computational resources on irrelevant data.

Step 4: Embedding Creation with Snowflake's Arctic Model¶

The final step in Indexify's pipeline is the creation of embeddings using Snowflake's Arctic embedding model. Embeddings are critical for enabling efficient similarity search and retrieval of relevant information from the chunked text.

Snowflake's Arctic model is a state-of-the-art embedding model that captures the semantic meaning of text chunks with remarkable accuracy. By representing each chunk as a high-dimensional vector, the Arctic model allows for fast and precise similarity comparisons between the query and the available chunks.

Indexify's pipeline leverages the power of the Arctic model to create high-quality embeddings for each chunk generated in the previous step. These embeddings serve as the basis for retrieving the most relevant chunks when a user poses a question to the RAG system.

By utilizing Snowflake's Arctic model, Indexify ensures that the RAG system can effectively identify and retrieve the chunks that are most pertinent to the given query. This enhances the accuracy of the generated responses and greatly improves the overall performance of the RAG system.

Creating LLMARRAG Pipeline is Simple with Indexify¶

Install Indexify, Start the Server & Download the Extractors¶

In [1]:

Copied!

!pip install -q -U indexify indexify-extractor-sdk
!pip install -q -U indexify indexify-extractor-sdk

Note: you may need to restart the kernel to use updated packages.

curl https://getindexify.ai | sh
./indexify server -d

!indexify-extractor download hub://pdf/marker
!indexify-extractor download hub://text/llm
!indexify-extractor download hub://text/chunking
!indexify-extractor download hub://embedding/arctic

!indexify-extractor join-server

Create a Client, Define Extraction Graph & Ingest Contents¶

In [2]:

Copied!

from indexify import IndexifyClient
client = IndexifyClient()
from indexify import IndexifyClient
client = IndexifyClient()

In [3]:

Copied!





from indexify import ExtractionGraph

extraction_graph_spec = """
name: 'llmarrag'
extraction_policies:
   - extractor: 'tensorlake/marker'
     name: 'mdextractor'
   - extractor: 'tensorlake/llm'
     name: 'txtprocessor'
     input_params:
        service: 'gemini'
        prompt: 'Rearrange and rewrite the following text by grouping similar topics together while preserving the original sentences.'
     content_source: 'mdextractor'
   - extractor: 'tensorlake/chunk-extractor'
     name: 'chunker'
     input_params:
        chunk_size: 1000
        overlap: 100
     content_source: 'txtprocessor'
   - extractor: 'tensorlake/arctic'
     name: 'embedder'
     content_source: 'chunker'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
from indexify import ExtractionGraph

extraction_graph_spec = """
name: 'llmarrag'
extraction_policies:
   - extractor: 'tensorlake/marker'
     name: 'mdextractor'
   - extractor: 'tensorlake/llm'
     name: 'txtprocessor'
     input_params:
        service: 'gemini'
        prompt: 'Rearrange and rewrite the following text by grouping similar topics together while preserving the original sentences.'
     content_source: 'mdextractor'
   - extractor: 'tensorlake/chunk-extractor'
     name: 'chunker'
     input_params:
        chunk_size: 1000
        overlap: 100
     content_source: 'txtprocessor'
   - extractor: 'tensorlake/arctic'
     name: 'embedder'
     content_source: 'chunker'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

In [4]:

Copied!

client.upload_file("llmarrag", "random_topics.pdf")
client.upload_file("llmarrag", "random_topics.pdf")

Out[4]:

'29e347f7f00d02ad'

The Impact of LLMARRAG Pipeline¶

Indexify's innovative pipeline has the potential to revolutionize the way RAG systems process and utilize mixed context texts. By addressing the limitations of existing approaches and leveraging cutting-edge technologies, our solution offers several key benefits:

Efficient Processing: By intelligently restructuring the text and performing enhanced chunking, Indexify's pipeline enables RAG systems to process mixed context texts efficiently. The topic-coherent segments and information-dense chunks minimize the inclusion of irrelevant information, saving valuable computational resources and reducing token waste in subsequent API calls to LLMs.
Improved Accuracy: The combination of intelligent text restructuring, enhanced chunking, and state-of-the-art embedding creation using Snowflake's Arctic model significantly improves the accuracy of RAG systems. By retrieving the most relevant chunks for a given query, the LLMs can generate responses that are more contextually appropriate and precise, enhancing the overall quality of the generated output.
Scalability: Indexify's pipeline is designed to handle large mixed context texts effectively. With the Gemini 1.5 Flash LLM's 2M context length and the efficient chunking algorithm, our solution can scale to process extensive documents and datasets, making it suitable for a wide range of applications and domains.
Flexibility and Customization: As an Open Source data framework, Indexify provides users with the flexibility to customize and extend the pipeline according to their specific needs. Researchers and developers can leverage our robust extractors, integrate their own models and algorithms, and adapt the pipeline to suit their particular use cases, fostering innovation and collaboration within the community.

Benchmarking¶

Metrics	Scores
Number of chunks required by traditional RAG to include all relevant information	5
Number of chunks required by LLMARRAG to include all relevant information	2

Number of tokens required by traditional RAG to include all relevant information	651
Number of tokens required by LLMARRAG to include all relevant information	255

Number of irrelevant tokens in the most relevant chunk by traditional RAG	100
Number of irrelevant tokens in the most relevant chunk by LLMARRAG	6

Difference between least relevant and irrelevant chunks by traditional RAG	5.59%
Difference between least relevant and irrelevant chunks by LLMARRAG	12.81%

Performing RAG with LLMARRAG and OpenAI¶

In [5]:

Copied!





def get_context(question: str, index: str, top_k=2):
    results = client.search_index(name=index, query=question, top_k=top_k)
    context = ""
    for result in results:
        context = context + f"content id: {result['content_id']} \n\n passage: {result['text']}\n"
    return context
def get_context(question: str, index: str, top_k=2):
    results = client.search_index(name=index, query=question, top_k=top_k)
    context = ""
    for result in results:
        context = context + f"content id: {result['content_id']} \n\n passage: {result['text']}\n"
    return context

In [12]:

Copied!

question = "Write an essay on the various organ systems like nervous system, digestive system."
context = get_context(question, "llmarrag.embedder.embedding")
context
question = "Write an essay on the various organ systems like nervous system, digestive system."
context = get_context(question, "llmarrag.embedder.embedding")
context

Out[12]:

'content id: 1af5ba1bb5955335 \n\n passage: **Organ Systems:**\n\n* **The endocrine system** is a network of glands that secrete hormones directly into the bloodstream, regulating various physiological processes such as growth, metabolism, and reproduction.\n* **The immune system** is a complex network of cells, tissues, and organs that work together to defend the body against infectious agents and other harmful substances.\n* **The nervous system** is a highly organized network of cells that transmits signals between different parts of the body, enabling communication, coordination, and response to stimuli.\n* **The circulatory system** is a network of blood vessels and the heart that transports oxygen, nutrients, and other essential substances throughout the body, while also removing waste products.\n* **The digestive system** is a series of organs that work together to break down food into nutrients that can be absorbed and used by the body for growth, repair, and energy production. \n\n## Computer Science and Related Fields:\ncontent id: fb5215bceaf1bb2a \n\n passage: **Genetics and Evolution:**\n\n* **DNA (deoxyribonucleic acid)** is the genetic material that carries the instructions for the development, functioning, and reproduction of all known living organisms.\n* **Genetics** is the study of heredity and the variation of inherited characteristics, based on the principles of Mendelian inheritance and the molecular structure of DNA.\n* **The theory of evolution by natural selection**, proposed by Charles Darwin, explains how species change over time through the inheritance of beneficial traits that increase their fitness in a given environment.\n\n**Organ Systems:**\n'

In [13]:

Copied!

def create_prompt(question, context):
    return f"Answer the question, based on the context.\n question: {question} \n context: {context}"

prompt = create_prompt(question, context)
def create_prompt(question, context):
    return f"Answer the question, based on the context.\n question: {question} \n context: {context}"

prompt = create_prompt(question, context)

In [14]:

Copied!

from openai import OpenAI
client_openai = OpenAI()
from openai import OpenAI
client_openai = OpenAI()

In [15]:

Copied!





chat_completion = client_openai.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-3.5-turbo",
)
print(chat_completion.choices[0].message.content)
chat_completion = client_openai.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-3.5-turbo",
)
print(chat_completion.choices[0].message.content)

Based on the context provided, the essay on various organ systems like the nervous system and the digestive system would include information about the endocrine system, the immune system, the nervous system, the circulatory system, and the digestive system. The nervous system is described as a network of cells that transmit signals between different parts of the body, enabling communication and response to stimuli. The digestive system is highlighted as a series of organs that work together to break down food into nutrients for growth, repair, and energy production.

Conclusion¶

Indexify's innovative pipeline presents a comprehensive solution to the challenges faced by modern RAG systems when processing mixed context texts. By combining advanced data extraction, intelligent text restructuring using Google's Gemini 1.5 Flash LLM, enhanced chunking with RecursiveCharacterTextSplitter, and embedding creation using Snowflake's Arctic model, our approach enables efficient and accurate retrieval-augmented generation.

The pipeline's ability to handle large mixed context texts, reduce token wastage, and improve the accuracy of generated responses makes it a valuable tool for organizations and researchers seeking to unlock the full potential of RAG systems. With Indexify's Open Source data framework and robust extractors, users can easily integrate our solution into their existing workflows and benefit from its powerful capabilities.

As the field of natural language processing continues to evolve, Indexify remains committed to driving innovation and pushing the boundaries of what is possible with RAG systems. We believe that our pipeline represents a significant step forward in enabling the efficient processing of complex, unstructured data and generating high-quality outputs that meet the diverse needs of users.

We invite researchers, developers, and organizations to explore Indexify's pipeline, contribute to its development, and join us in shaping the future of retrieval-augmented generation. Together, we can unlock the immense potential of RAG systems and revolutionize the way we interact with and derive insights from mixed context texts.