Skip to main content
  1. Articels/

From paper to source code: a detailed explanation of the RAG algorithm

·9743 words·46 mins
RAG LLM AI
Weaxs
Author
Weaxs
Table of Contents

Introduction
#

So far, the main practical scenario based on large models is still RAG. Compared with last year’s method of simply slicing and storing documents for queries, this year has seen the introduction of many innovative algorithms or methods that effectively improve the effectiveness of RAG, such as Microsoft’s open source GraphRAG, Contextual Retrieval proposed by the Anthropic team, LightRAG proposed by the University of Hong Kong, etc.

This article aims to summarize these RAG methods and understand their design principles and specific code implementations from the perspective of papers and source code.

GraphRAG
#

Overview
#

GraphRAG is well-known. Let’s start by talking about the pain points that led to its proposal. The main idea is to solve the problem of users who have asked a relatively large or global question and cannot retrieve one or more documents to provide an answer. For example, in the field of stocks, users may ask about a certain industry sector, or in a legal scenario, users may ask about a certain type of similar case, etc. These types of questions may involve more than just a few documents. With traditional RAG methods, the retrieval can only adjust the matching document segments to provide more knowledge for the large model. However, this may lead to another problem, namely the length limit of the context of the large model (although this limit has been greatly reduced at the basic model level at present, with Claude-3.5 supporting 200k tokens, most open source models also supporting models with a 128k context, etc.).

The solution proposed by GraphRAG to this problem is to construct a knowledge base in which documents are constructed as a structured graph,

Index construction
#

GraphRAG divides this part into 5 main parts:

  1. Document workflow: mainly used for parsing and embedding of raw documents, note that this does not include slicing
  2. Chunk workflow: slices (chunks) the raw document, converts it to a Text Unit object, and then embeds the Text Unit
  3. Graph workflow: Parses the content in the chunk, extracts the name, type, description and relationship between the chunks, converts it into an element object, and the graph is initially formed. Then, the description information in the chunk is summarized again based on the parsed characteristic elements. Finally, all nodes in the graph are classified into different groups, and multiple groups will form a community. Here, a multi-layered community, that is, the Graph community, will eventually be formed
  4. Community workflow: The Graph community has been constructed. This step is mainly to let the large model summarize multiple communities and groups in the community to generate descriptive information
  5. Covariate workflow: This final step is optional and mainly extracts the covariates in the Text Unit object in the second step. This article will not focus on this

GraphRAG build_index_pipeline.png

## https://github.com/microsoft/graphrag/blob/main/graphrag/index/create_pipeline_config.py

def create_pipeline_config(settings: GraphRagConfig, verbose=False) -> PipelineConfig:
    ...

    skip_workflows = settings.skip_workflows
    embedded_fields = _get_embedded_fields(settings)
    covariates_enabled = (
        settings.claim_extraction.enabled
        and create_final_covariates not in skip_workflows
    )

    result = PipelineConfig(
        root_dir=settings.root_dir,
        input=_get_pipeline_input_config(settings),
        reporting=_get_reporting_config(settings),
        storage=_get_storage_config(settings, settings.storage),
        update_index_storage=_get_storage_config(
            settings, settings.update_index_storage
        ),
        cache=_get_cache_config(settings),
        ## workflow execution order
        workflows=[
           ## 1. create_final_documents
            *_document_workflows(settings, embedded_fields),
            
            ## 1. create_base_text_units
            ## 2. create_final_text_units
            *_text_unit_workflows(settings, covariates_enabled, embedded_fields),
            
            ## 1. create_base_entity_graph
            ## 2. create_final_entities
            ## 3. create_final_relationships
            ## 4. create_final_nodes
            *_graph_workflows(settings, embedded_fields),
            
            ## 1. create_final_communities
            ## 2. create_final_community_reports
            *_community_workflows(settings, covariates_enabled, embedded_fields),
            
            ## 1. create_final_covariates
            *(_covariate_workflows(settings) if covariates_enabled else []),
        ],
    )

    ## Remove any workflows that were specified to be skipped
    log.info("skipping workflows %s", ",".join(skip_workflows))
    ## Execute workflows
    result.workflows = [w for w in result.workflows if w.name not in skip_workflows]
    return result

Source Document → Text Chunks
#

This step mainly involves splitting the document into chunks. GraphRAG officially provides two methods:

  1. Token-based splitting. This part of the implementation mainly refers to the text_splitter in langchain, and the specification is sliced by setting the chunk_size and chunk_overlap parameters. The token calculation part uses the cl100k_base model
  2. Sentence segmentation is based on the Natural Language Toolkit package nltk.tokenize.sent_tokenize(text, language='english')

is based on token segmentation. In this part, you need to pay attention to the chunk size, that is, tokens_per_chunk. If each chunk is too small, it will take longer and cost more to call the model and generate the index; but if each chunk is too small, it may affect the recall rate and accuracy.

The core code for this part can be found in graphrag/index/operations/chunk_text/strategies.py

## https://github.com/microsoft/graphrag/blob/main/graphrag/index/operations/chunk_text/strategies.py

## Slice by token
def run_tokens(
    input: list[str], args: dict[str, Any], tick: ProgressTicker
) -> Iterable[TextChunk]:
    """Chunks text into chunks based on encoding tokens."""
    tokens_per_chunk = args.get("chunk_size", defs.CHUNK_SIZE)
    chunk_overlap = args.get("chunk_overlap", defs.CHUNK_OVERLAP)
    encoding_name = args.get("encoding_name", defs.ENCODING_MODEL)
    enc = tiktoken.get_encoding(encoding_name)

    def encode(text: str) -> list[int]:
        if not isinstance(text, str):
            text = f"{text}"
        return enc.encode(text)

    def decode(tokens: list[int]) -> str:
        return enc.decode(tokens)

    return _split_text_on_tokens(
        input,
        Tokenizer(
            chunk_overlap=chunk_overlap,
            tokens_per_chunk=tokens_per_chunk,
            encode=encode,
            decode=decode,
        ),
        tick,
    )

## Adapted from - https://github.com/langchain-ai/langchain/blob/77b359edf5df0d37ef0d539f678cf64f5557cb54/libs/langchain/langchain/text_splitter.py#L471
## So we could have better control over the chunking process
def _split_text_on_tokens(
    texts: list[str], enc: Tokenizer, tick: ProgressTicker
) -> list[TextChunk]:
    """Split incoming text and return chunks."""
    result = []
    mapped_ids = []

    for source_doc_idx, text in enumerate(texts):
        encoded = enc.encode(text)
        tick(1)
        mapped_ids.append((source_doc_idx, encoded))

    input_ids: list[tuple[int, int]] = [
        (source_doc_idx, id) for source_doc_idx, ids in mapped_ids for id in ids
    ]

    start_idx = 0
    cur_idx = min(start_idx + enc.tokens_per_chunk, len(input_ids))
    chunk_ids = input_ids[start_idx:cur_idx]
    while start_idx < len(input_ids):
        chunk_text = enc.decode([id for _, id in chunk_ids])
        doc_indices = list({doc_idx for doc_idx, _ in chunk_ids})
        result.append(
            TextChunk(
                text_chunk=chunk_text,
                source_doc_indices=doc_indices,
                n_tokens=len(chunk_ids),
            )
        )
        start_idx += enc.tokens_per_chunk - enc.chunk_overlap
        cur_idx = min(start_idx + enc.tokens_per_chunk, len(input_ids))
        chunk_ids = input_ids[start_idx:cur_idx]

    return result
   
## Slice by sentence
def run_sentences(
    input: list[str], _args: dict[str, Any], tick: ProgressTicker
) -> Iterable[TextChunk]:
    """Chunks text into multiple parts by sentence."""
    for doc_idx, text in enumerate(input):
        sentences = nltk.sent_tokenize(text)
        for sentence in sentences:
            yield TextChunk(
                text_chunk=sentence,
                source_doc_indices=[doc_idx],
            )
        tick(1)

Text Chunks → Element Instances
#

This part mainly extracts the characteristic elements in each chunk and constructs them into element entities. These characteristic elements may include names, types, descriptions, etc., so that the relationship between each chunk can be established later. Again, GraphRAG provides 2 methods for this part:

  1. graph intelligence: This refers to extraction using a large model, which is also where we focus our attention
  2. nltk: Again, this method also uses the nltk.chunk.ne_chunk(tagged_tokens, binary=False) function from the Natural Language Toolkit package.

Here we will specifically talk about method 1, and mainly look at the prompt word of GraphRAG

GRAPH_EXTRACTION_PROMPT = """
-Goal-
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
 
-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)
 
2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
 Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>)
 
3. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.
 
4. When finished, output {completion_delimiter}
 
######################
-Examples-
######################
Example 1:
Entity_types: ORGANIZATION,PERSON
Text:
The Verdantis's Central Institution is scheduled to meet on Monday and Thursday, with the institution planning to release its latest policy decision on Thursday at 1:30 p.m. PDT, followed by a press conference where Central Institution Chair Martin Smith will take questions. Investors expect the Market Strategy Committee to hold its benchmark interest rate steady in a range of 3.5%-3.75%.
######################
Output:
("entity"{tuple_delimiter}CENTRAL INSTITUTION{tuple_delimiter}ORGANIZATION{tuple_delimiter}The Central Institution is the Federal Reserve of Verdantis, which is setting interest rates on Monday and Thursday)
{record_delimiter}
("entity"{tuple_delimiter}MARTIN SMITH{tuple_delimiter}PERSON{tuple_delimiter}Martin Smith is the chair of the Central Institution)
{record_delimiter}
("entity"{tuple_delimiter}MARKET STRATEGY COMMITTEE{tuple_delimiter}ORGANIZATION{tuple_delimiter}The Central Institution committee makes key decisions about interest rates and the growth of Verdantis's money supply)
{record_delimiter}
("relationship"{tuple_delimiter}MARTIN SMITH{tuple_delimiter}CENTRAL INSTITUTION{tuple_delimiter}Martin Smith is the Chair of the Central Institution and will answer questions at a press conference{tuple_delimiter}9)
{completion_delimiter}

######################
Example 2:
Entity_types: ORGANIZATION
Text:
TechGlobal's (TG) stock skyrocketed in its opening day on the Global Exchange Thursday. But IPO experts warn that the semiconductor corporation's debut on the public markets isn't indicative of how other newly listed companies may perform.

TechGlobal, a formerly public company, was taken private by Vision Holdings in 2014. The well-established chip designer says it powers 85% of premium smartphones.
######################
Output:
("entity"{tuple_delimiter}TECHGLOBAL{tuple_delimiter}ORGANIZATION{tuple_delimiter}TechGlobal is a stock now listed on the Global Exchange which powers 85% of premium smartphones)
{record_delimiter}
("entity"{tuple_delimiter}VISION HOLDINGS{tuple_delimiter}ORGANIZATION{tuple_delimiter}Vision Holdings is a firm that previously owned TechGlobal)
{record_delimiter}
("relationship"{tuple_delimiter}TECHGLOBAL{tuple_delimiter}VISION HOLDINGS{tuple_delimiter}Vision Holdings formerly owned TechGlobal from 2014 until present{tuple_delimiter}5)
{completion_delimiter}

######################
Example 3:
Entity_types: ORGANIZATION,GEO,PERSON
Text:
Five Aurelians jailed for 8 years in Firuzabad and widely regarded as hostages are on their way home to Aurelia.

The swap orchestrated by Quintara was finalized when $8bn of Firuzi funds were transferred to financial institutions in Krohaara, the capital of Quintara.

The exchange initiated in Firuzabad's capital, Tiruzia, led to the four men and one woman, who are also Firuzi nationals, boarding a chartered flight to Krohaara.

They were welcomed by senior Aurelian officials and are now on their way to Aurelia's capital, Cashion.

The Aurelians include 39-year-old businessman Samuel Namara, who has been held in Tiruzia's Alhamia Prison, as well as journalist Durke Bataglani, 59, and environmentalist Meggie Tazbah, 53, who also holds Bratinas nationality.
######################
Output:
("entity"{tuple_delimiter}FIRUZABAD{tuple_delimiter}GEO{tuple_delimiter}Firuzabad held Aurelians as hostages)
{record_delimiter}
("entity"{tuple_delimiter}AURELIA{tuple_delimiter}GEO{tuple_delimiter}Country seeking to release hostages)
{record_delimiter}
("entity"{tuple_delimiter}QUINTARA{tuple_delimiter}GEO{tuple_delimiter}Country that negotiated a swap of money in exchange for hostages)
{record_delimiter}
{record_delimiter}
("entity"{tuple_delimiter}TIRUZIA{tuple_delimiter}GEO{tuple_delimiter}Capital of Firuzabad where the Aurelians were being held)
{record_delimiter}
("entity"{tuple_delimiter}KROHAARA{tuple_delimiter}GEO{tuple_delimiter}Capital city in Quintara)
{record_delimiter}
("entity"{tuple_delimiter}CASHION{tuple_delimiter}GEO{tuple_delimiter}Capital city in Aurelia)
{record_delimiter}
("entity"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}PERSON{tuple_delimiter}Aurelian who spent time in Tiruzia's Alhamia Prison)
{record_delimiter}
("entity"{tuple_delimiter}ALHAMIA PRISON{tuple_delimiter}GEO{tuple_delimiter}Prison in Tiruzia)
{record_delimiter}
("entity"{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}PERSON{tuple_delimiter}Aurelian journalist who was held hostage)
{record_delimiter}
("entity"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}PERSON{tuple_delimiter}Bratinas national and environmentalist who was held hostage)
{record_delimiter}
("relationship"{tuple_delimiter}FIRUZABAD{tuple_delimiter}AURELIA{tuple_delimiter}Firuzabad negotiated a hostage exchange with Aurelia{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}QUINTARA{tuple_delimiter}AURELIA{tuple_delimiter}Quintara brokered the hostage exchange between Firuzabad and Aurelia{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}QUINTARA{tuple_delimiter}FIRUZABAD{tuple_delimiter}Quintara brokered the hostage exchange between Firuzabad and Aurelia{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}ALHAMIA PRISON{tuple_delimiter}Samuel Namara was a prisoner at Alhamia prison{tuple_delimiter}8)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}Samuel Namara and Meggie Tazbah were exchanged in the same hostage release{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}Samuel Namara and Durke Bataglani were exchanged in the same hostage release{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}Meggie Tazbah and Durke Bataglani were exchanged in the same hostage release{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}FIRUZABAD{tuple_delimiter}Samuel Namara was a hostage in Firuzabad{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}FIRUZABAD{tuple_delimiter}Meggie Tazbah was a hostage in Firuzabad{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}FIRUZABAD{tuple_delimiter}Durke Bataglani was a hostage in Firuzabad{tuple_delimiter}2)
{completion_delimiter}

######################
-Real Data-
######################
Entity_types: {entity_types}
Text: {input_text}
######################
Output:"""

The result after interacting with the large model is finally converted into a NetworkX Graph, which is an undirected graph with authority. At this time, the graph mainly has two parts:

  • node: the nodes in the graph are the feature entities themselves, which mainly record the original document, name, type and description information
  • edge: the edges in the graph mainly record the relationship and weight between nodes

Element Instances → Element Summaries
#

In the previous step, you obtained an undirected graph with relationships between nodes. This part mainly uses LLM to further summarize the summary based on the feature element entity. The purpose of this regeneration is to generate descriptions that were previously generated separately, so they are relatively thin and do not take into account the context of the overall set of element instances.

The final description generated by the large model will be replaced with the description content of the nodes and edges of the Graph.

The core of this part is also the prompt words, and the specific prompt words are as follows:

SUMMARIZE_PROMPT = """
You are a helpful assistant responsible for generating a comprehensive summary of the data provided below.
Given one or two entities, and a list of descriptions, all related to the same entity or group of entities.
Please concatenate all of these into a single, comprehensive description. Make sure to include information collected from all the descriptions.
If the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary.
Make sure it is written in third person, and include the entity names so we have the full context.

#######
-Data-
Entities: {entity_name}
Description List: {description_list}
#######
Output:
"""

Element Summaries → Graph Communities
#

The first two steps gave us the final undirected graph. This step is to group the nodes to construct “communities”. Each community here is an array of nodes, each community has a cluster_id community group id and a set of nodes

This part is mainly implemented using the graspologic package, which is mainly used for algorithms for statistics and analysis of Graph graphs. GraghRAG uses this information to construct a complete “community group”. Put simply, the hierarchical_leiden method will divide the entire graph Gragh into multiple regions, each region containing max_cluster_size nodes (the default in GraphRAG is 10). Here, a node can be a feature element entity node itself, or it can also be a subcommunity.

## https://github.com/microsoft/graphrag/blob/main/graphrag/index/operations/cluster_graph.py

def run_layout(strategy: dict[str, Any], graph: nx.Graph) -> Communities:
    """Run layout method definition."""
    if len(graph.nodes) == 0:
        log.warning("Graph has no nodes")
        return []

    clusters: dict[int, dict[str, list[str]]] = {}
    strategy_type = strategy.get("type", GraphCommunityStrategyType.leiden)
    match strategy_type:
        case GraphCommunityStrategyType.leiden:
            clusters = run_leiden(graph, strategy)
        case _:
            msg = f"Unknown clustering strategy {strategy_type}"
            raise ValueError(msg)

    results: Communities = []
    for level in clusters:
        for cluster_id, nodes in clusters[level].items():
            results.append((level, cluster_id, nodes))
    return results

def run_leiden(
    graph: nx.Graph, args: dict[str, Any])
 -> dict[int, dict[str, list[str]]]:
    """Run method definition."""
    max_cluster_size = args.get("max_cluster_size", 10)
    use_lcc = args.get("use_lcc", True)
    if args.get("verbose", False):
        log.info(
            "Running leiden with max_cluster_size=%s, lcc=%s", max_cluster_size, use_lcc
        )

    node_id_to_community_map = _compute_leiden_communities(
        graph=graph,
        max_cluster_size=max_cluster_size,
        use_lcc=use_lcc,
        seed=args.get("seed", 0xDEADBEEF),
    )
    levels = args.get("levels")

    ## If they don't pass in levels, use them all
    if levels is None:
        levels = sorted(node_id_to_community_map.keys())

    results_by_level: dict[int, dict[str, list[str]]] = {}
    for level in levels:
        result = {}
        results_by_level[level] = result
        for node_id, raw_community_id in node_id_to_community_map[level].items():
            community_id = str(raw_community_id)
            if community_id not in result:
                result[community_id] = []
            result[community_id].append(node_id)
    return results_by_level

## Taken from graph_intelligence & adapted
def _compute_leiden_communities(
    graph: nx.Graph | nx.DiGraph,
    max_cluster_size: int,
    use_lcc: bool,
    seed=0xDEADBEEF,)
 -> dict[int, dict[str, int]]:
    """Return Leiden root communities."""
    if use_lcc:
        graph = stable_largest_connected_component(graph)
   
  ## https://github.com/graspologic-org/graspologic/blob/main/graspologic/partition/leiden.py
    community_mapping = hierarchical_leiden(
        graph, max_cluster_size=max_cluster_size, random_seed=seed
    )
    results: dict[int, dict[str, int]] = {}
    for partition in community_mapping:
        results[partition.level] = results.get(partition.level, {})
        results[partition.level][partition.node] = partition.cluster

    return results

Graph Communities → Community Summaries
#

We have obtained the final graph and the distribution of different communities in it. The last step is to add summary information for these communities. As mentioned above, the nodes in a community can be instances of the feature elements of the partition or sub-communities. Therefore, the summary information here is divided into two types: ① summary of leaf-level communities and ② summary of higher-level communities.

## https://github.com/microsoft/graphrag/blob/main/graphrag/prompts/index/community_report.py

COMMUNITY_REPORT_PROMPT = """
You are an AI assistant that helps a human analyst to perform general information discovery. Information discovery is the process of identifying and assessing relevant information associated with certain entities (e.g., organizations and individuals) within a network.

## Goal
Write a comprehensive report of a community, given a list of entities that belong to the community as well as their relationships and optional associated claims. The report will be used to inform decision-makers about information associated with the community and their potential impact. The content of this report includes an overview of the community's key entities, their legal compliance, technical capabilities, reputation, and noteworthy claims.

## Report Structure

The report should include the following sections:

- TITLE: community's name that represents its key entities - title should be short but specific. When possible, include representative named entities in the title.
- SUMMARY: An executive summary of the community's overall structure, how its entities are related to each other, and significant information associated with its entities.
- IMPACT SEVERITY RATING: a float score between 0-10 that represents the severity of IMPACT posed by entities within the community.  IMPACT is the scored importance of a community.
- RATING EXPLANATION: Give a single sentence explanation of the IMPACT severity rating.
- DETAILED FINDINGS: A list of 5-10 key insights about the community. Each insight should have a short summary followed by multiple paragraphs of explanatory text grounded according to the grounding rules below. Be comprehensive.

Return output as a well-formed JSON-formatted string with the following format:
    {{
        "title": <report_title>,
        "summary": <executive_summary>,
        "rating": <impact_severity_rating>,
        "rating_explanation": <rating_explanation>,
        "findings": [
            {{
                "summary":<insight_1_summary>,
                "explanation": <insight_1_explanation>
            }},
            {{
                "summary":<insight_2_summary>,
                "explanation": <insight_2_explanation>
            }}
        ]
    }}

## Grounding Rules

Points supported by data should list their data references as follows:

"This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."

Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.

For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (1), Entities (5, 7); Relationships (23); Claims (7, 2, 34, 64, 46, +more)]."

where 1, 5, 7, 23, 2, 34, 46, and 64 represent the id (not the index) of the relevant data record.

Do not include information where the supporting evidence for it is not provided.

## Example Input
-----------
Text:

Entities

id,entity,description
5,VERDANT OASIS PLAZA,Verdant Oasis Plaza is the location of the Unity March
6,HARMONY ASSEMBLY,Harmony Assembly is an organization that is holding a march at Verdant Oasis Plaza

Relationships

id,source,target,description
37,VERDANT OASIS PLAZA,UNITY MARCH,Verdant Oasis Plaza is the location of the Unity March
38,VERDANT OASIS PLAZA,HARMONY ASSEMBLY,Harmony Assembly is holding a march at Verdant Oasis Plaza
39,VERDANT OASIS PLAZA,UNITY MARCH,The Unity March is taking place at Verdant Oasis Plaza
40,VERDANT OASIS PLAZA,TRIBUNE SPOTLIGHT,Tribune Spotlight is reporting on the Unity march taking place at Verdant Oasis Plaza
41,VERDANT OASIS PLAZA,BAILEY ASADI,Bailey Asadi is speaking at Verdant Oasis Plaza about the march
43,HARMONY ASSEMBLY,UNITY MARCH,Harmony Assembly is organizing the Unity March

Output:
{{
    "title": "Verdant Oasis Plaza and Unity March",
    "summary": "The community revolves around the Verdant Oasis Plaza, which is the location of the Unity March. The plaza has relationships with the Harmony Assembly, Unity March, and Tribune Spotlight, all of which are associated with the march event.",
    "rating": 5.0,
    "rating_explanation": "The impact severity rating is moderate due to the potential for unrest or conflict during the Unity March.",
    "findings": [
        {{
            "summary": "Verdant Oasis Plaza as the central location",
            "explanation": "Verdant Oasis Plaza is the central entity in this community, serving as the location for the Unity March. This plaza is the common link between all other entities, suggesting its significance in the community. The plaza's association with the march could potentially lead to issues such as public disorder or conflict, depending on the nature of the march and the reactions it provokes. [Data: Entities (5), Relationships (37, 38, 39, 40, 41,+more)]"
        }},
        {{
            "summary": "Harmony Assembly's role in the community",
            "explanation": "Harmony Assembly is another key entity in this community, being the organizer of the march at Verdant Oasis Plaza. The nature of Harmony Assembly and its march could be a potential source of threat, depending on their objectives and the reactions they provoke. The relationship between Harmony Assembly and the plaza is crucial in understanding the dynamics of this community. [Data: Entities(6), Relationships (38, 43)]"
        }},
        {{
            "summary": "Unity March as a significant event",
            "explanation": "The Unity March is a significant event taking place at Verdant Oasis Plaza. This event is a key factor in the community's dynamics and could be a potential source of threat, depending on the nature of the march and the reactions it provokes. The relationship between the march and the plaza is crucial in understanding the dynamics of this community. [Data: Relationships (39)]"
        }},
        {{
            "summary": "Role of Tribune Spotlight",
            "explanation": "Tribune Spotlight is reporting on the Unity March taking place in Verdant Oasis Plaza. This suggests that the event has attracted media attention, which could amplify its impact on the community. The role of Tribune Spotlight could be significant in shaping public perception of the event and the entities involved. [Data: Relationships (40)]"
        }}
    ]
}}

## Real Data

Use the following text for your answer. Do not make anything up in your answer.

Text:
{input_text}

The report should include the following sections:

- TITLE: community's name that represents its key entities - title should be short but specific. When possible, include representative named entities in the title.
- SUMMARY: An executive summary of the community's overall structure, how its entities are related to each other, and significant information associated with its entities.
- IMPACT SEVERITY RATING: a float score between 0-10 that represents the severity of IMPACT posed by entities within the community.  IMPACT is the scored importance of a community.
- RATING EXPLANATION: Give a single sentence explanation of the IMPACT severity rating.
- DETAILED FINDINGS: A list of 5-10 key insights about the community. Each insight should have a short summary followed by multiple paragraphs of explanatory text grounded according to the grounding rules below. Be comprehensive.

Return output as a well-formed JSON-formatted string with the following format:
    {{
        "title": <report_title>,
        "summary": <executive_summary>,
        "rating": <impact_severity_rating>,
        "rating_explanation": <rating_explanation>,
        "findings": [
            {{
                "summary":<insight_1_summary>,
                "explanation": <insight_1_explanation>
            }},
            {{
                "summary":<insight_2_summary>,
                "explanation": <insight_2_explanation>
            }}
        ]
    }}

## Grounding Rules

Points supported by data should list their data references as follows:

"This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."

Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.

For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (1), Entities (5, 7); Relationships (23); Claims (7, 2, 34, 64, 46, +more)]."

where 1, 5, 7, 23, 2, 34, 46, and 64 represent the id (not the index) of the relevant data record.

Do not include information where the supporting evidence for it is not provided.

Output:"""

Global query
#

We have talked about the logic and hierarchy of the entire knowledge base construction above, so queries also need to be performed at different levels. GraphRAG divides queries into 2 steps:

  1. Get the summary of all communities, then stitch the summary information into a context and give it to the large model together with the query, so that the large model can score and explain each of these contents. The model will be called multiple times here to construct the community answer (Note: this step is divided into two steps in the paper, namely, ① prepare the community summary and ② obtain the community answer).
  2. Filter and sort the community answers obtained, and then splice them into the context in descending order of ranking, and feed them all to the large model until the model’s token limit is reached, so that the large model can answer the final global answer

For the specific code, see the following

## https://github.com/microsoft/graphrag/blob/main/graphrag/query/structured_search/global_search/search.py

class GlobalSearch(BaseSearch[GlobalContextBuilder]):
    """Search orchestration for global search mode."""
    
    ...
    
    async def asearch(
        self,
        query: str,
        conversation_history: ConversationHistory | None = None,
        **kwargs: Any,
    ) -> GlobalSearchResult:
    
     """
        - Step 1: Run parallel LLM calls on communities' short summaries to generate answer for each batch
        """
        ...
        
     ## Prepare batches of community report data table as context data for global search.
        context_result = await self.context_builder.build_context(
            query=query,
            conversation_history=conversation_history,
            **self.context_builder_params,
        )
      
        ...
        
        map_responses = await asyncio.gather(*[
           ## Generate answer for a single chunk of community reports.
            self._map_response_single_batch(
                context_data=data, query=query, **self.map_llm_params
            )
            for data in context_result.context_chunks
        ])
        
        """
        - Step 2: Combine the answers from step 2 to generate the final answer
        """
        reduce_response = await self._reduce_response(
            map_responses=map_responses,
            query=query,
            **self.reduce_llm_params,
        )
        

Construct community answers
#

In addition to the code, the prompt words in these two steps are also important.

The prompt for getting the “community answer” is as follows, which ultimately gives the score and description from the large model. Note that this part below is only the system persona prompt for the large model, and the user prompt is the Query itself

"""System prompts for global search."""

MAP_SYSTEM_PROMPT = """
---Role---

You are a helpful assistant responding to questions about data in the tables provided.

---Goal---

Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables.

You should use the data provided in the data tables below as the primary context for generating the response.
If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.

Each key point in the response should have the following element:
- Description: A comprehensive description of the point.
- Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0.

The response should be JSON formatted as follows:
{{
    "points": [
        {{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}},
        {{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}}
    ]
}}

The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".

Points supported by data should list the relevant reports as references as follows:
"This is an example sentence supported by data references [Data: Reports (report ids)]"

**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.

For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"

where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables.

Do not include information where the supporting evidence for it is not provided.

---Data tables---

{context_data}

---Goal---

Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables.

You should use the data provided in the data tables below as the primary context for generating the response.
If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.

Each key point in the response should have the following element:
- Description: A comprehensive description of the point.
- Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0.

The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".

Points supported by data should list the relevant reports as references as follows:
"This is an example sentence supported by data references [Data: Reports (report ids)]"

**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.

For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"

where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables.

Do not include information where the supporting evidence for it is not provided.

The response should be JSON formatted as follows:
{{
    "points": [
        {{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}},
        {{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}}
    ]
}}
"""

Construct the global answer
#

The prompt for obtaining the final “global answer” is as follows. Here, the same is the system persona prompt for the large model, and the user prompt is also the Query itself.

Another thing to note is that if all “community answers” have been filtered out before the model is called, then the content in NO_DATA_ANSWER will be returned directly.

"""Global Search system prompts."""

REDUCE_SYSTEM_PROMPT = """
---Role---

You are a helpful assistant responding to questions about a dataset by synthesizing perspectives from multiple analysts.

---Goal---

Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset.

Note that the analysts' reports provided below are ranked in the **descending order of importance**.

If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up.

The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format.

Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.

The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".

The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.

**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.

For example:

"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"

where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.

Do not include information where the supporting evidence for it is not provided.

---Target response length and format---

{response_type}

---Analyst Reports---

{report_data}

---Goal---

Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset.

Note that the analysts' reports provided below are ranked in the **descending order of importance**.

If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up.

The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format.

The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".

The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.

**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.

For example:

"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"

where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.

Do not include information where the supporting evidence for it is not provided.

---Target response length and format---

{response_type}

Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
"""

NO_DATA_ANSWER = (
    "I am sorry but I am unable to answer this question given the provided data."
)

Local query
#

As you can see from the global query above, it does not use the embedding + vector db method to query, but directly uses the large model. The biggest difference between local query and global is the first step, which does not use the large model but the embedding + vector db:

  1. Embed the original query, and then use the vector to query the document entities in the knowledge base. This way, you can get the document that is most similar to the query. Then, based on the information about the community, shards, and relationships contained in the document, you can construct the community context, local context, and TextUnit context
  2. Call the large model to generate the final answer based on the community context
## https://github.com/microsoft/graphrag/blob/main/graphrag/query/structured_search/local_search/search.py

class LocalSearch(BaseSearch[LocalContextBuilder]):
    """Search orchestration for local search mode."""
    
    async def asearch(
        self,
        query: str,
        conversation_history: ConversationHistory | None = None,
        **kwargs,
    ) -> SearchResult:
        """Build local search context that fits a single context window and generate answer for the user query."""
        
        ...
        ## Step 1
        context_result = self.context_builder.build_context(
            query=query,
            conversation_history=conversation_history,
            **kwargs,
            **self.context_builder_params,
        )
        
        
        log.info("GENERATE ANSWER: %s. QUERY: %s", start_time, query)
        ## Step 2
        try:
            ...
            search_messages = [
                {"role": "system", "content": search_prompt},
                {"role": "user", "content": query},
            ]

            response = await self.llm.agenerate(
                messages=search_messages,
                streaming=True,
                callbacks=self.callbacks,
                **self.llm_params,
            )

Construct community context
#

The words here are mainly divided into 2 parts:

  1. Query the knowledge base: querying the knowledge base is the traditional embedding + vector to approximate the query document entity entity, which will not be expanded here.
  2. Context construction
  3. Construct the community context (Community): constructing the community context is to assemble the community context based on the approximate documents found by the approximate query – the focus is on the knowledge and summary of the community
  4. Construct Local Context (Local): mainly constructs a context that includes an entity and the relationships in the graph where it is located – the focus is on the document entity itself and its relationships in the query
  5. Construct Text Unit Context (Text Unit): sorts the matched text units and uses them as context
## https://github.com/microsoft/graphrag/blob/main/graphrag/query/structured_search/local_search/mixed_context.py
class LocalSearchMixedContext(LocalContextBuilder):
    """Build data context for local search prompt combining community reports and entity/relationship/covariate tables."""
   ...
   
   def build_context(...) -> ContextBuilderResult:
        """
        Build data context for local search prompt.

        Build a context by combining community reports and entity/relationship/covariate tables, and text units using a predefined ratio set by summary_prop.
        """
        
        ...
        ## Step 1 : vector similarity search 
        selected_entities = map_query_to_entities(
            query=query,
            text_embedding_vectorstore=self.entity_text_embeddings,
            text_embedder=self.text_embedder,
            all_entities_dict=self.entities,
            embedding_vectorstore_key=self.embedding_vectorstore_key,
            include_entity_names=include_entity_names,
            exclude_entity_names=exclude_entity_names,
            k=top_k_mapped_entities,
            oversample_scaler=2,
        )
        
        ...
        ## Step 2 : build community context by selected entities
     community_context, community_context_data = self._build_community_context(
            selected_entities=selected_entities,
            max_tokens=community_tokens,
            use_community_summary=use_community_summary,
            column_delimiter=column_delimiter,
            include_community_rank=include_community_rank,
            min_community_rank=min_community_rank,
            return_candidate_context=return_candidate_context,
            context_name=community_context_name,
        )
        ...
        
        ## Step 3 : build local content by selected entities (i.e. entity-relationship-covariate) context
        local_prop = 1 - community_prop - text_unit_prop
        local_tokens = max(int(max_tokens * local_prop), 0)
        local_context, local_context_data = self._build_local_context(
            selected_entities=selected_entities,
            max_tokens=local_tokens,
            include_entity_rank=include_entity_rank,
            rank_description=rank_description,
            include_relationship_weight=include_relationship_weight,
            top_k_relationships=top_k_relationships,
            relationship_ranking_attribute=relationship_ranking_attribute,
            return_candidate_context=return_candidate_context,
            column_delimiter=column_delimiter,
        )
        ...
        
        ## Step 4 : 
        text_unit_context, text_unit_context_data = self._build_text_unit_context(
            selected_entities=selected_entities,
            max_tokens=text_unit_tokens,
            return_candidate_context=return_candidate_context,
        )
        ...
        
        
        
## https://github.com/microsoft/graphrag/blob/main/graphrag/query/context_builder/entity_extraction.py
## Step 1
def map_query_to_entities(
    query: str,
    text_embedding_vectorstore: BaseVectorStore,
    text_embedder: BaseTextEmbedding,
    all_entities_dict: dict[str, Entity],
    embedding_vectorstore_key: str = EntityVectorStoreKey.ID,
    include_entity_names: list[str] | None = None,
    exclude_entity_names: list[str] | None = None,
    k: int = 10,
    oversample_scaler: int = 2,
) -> list[Entity]:
    """Extract entities that match a given query using semantic similarity of text embeddings of query and entity descriptions."""
    ...
    
    matched_entities = []
    if query != "":
        ## get entities with highest semantic similarity to query
        ## oversample to account for excluded entities
        search_results = text_embedding_vectorstore.similarity_search_by_text(
            text=query,
            text_embedder=lambda t: text_embedder.embed(t),
            k=k * oversample_scaler,
        )
        
        ...
   ...

Get local answers
#

The last step is to pass the context built above to the large model. Here, the system prompt word is mainly set, and the user prompt word is the Query itself

QUESTION_SYSTEM_PROMPT = """
---Role---

You are a helpful assistant generating a bulleted list of {question_count} questions about data in the tables provided.

---Data tables---

{context_data}

---Goal---

Given a series of example questions provided by the user, generate a bulleted list of {question_count} candidates for the next question. Use - marks as bullet points.

These candidate questions should represent the most important or urgent information content or themes in the data tables.

The candidate questions should be answerable using the data tables provided, but should not mention any specific data fields or data tables in the question text.

If the user's questions reference several named entities, then each candidate question should reference all named entities.

---Example questions---
"""

Related #

microsoft/graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system

Python
20062
1960

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Welcome - GraphRAG

LightRAG
#

Overview
#

LightRAG is proposed to focus on two main points: the first point is similar to GraphRAG and mainly aims to construct complex relationships and contexts between document entities; the second point mainly focuses on the persistence of entity relationships and contexts to improve query efficiency.

Therefore, LightRAG is trying to maintain the complete document context relationship while improving search efficiency and quick data updates as much as possible.

LightRAG.png

LightRAG can be understood as a simplified version of GraphRAG, which removes the Community part of GraphRAG and directly constructs the graph based on Entities and their relationships, greatly reducing complexity and token consumption. Obviously, this will make LightRAG faster than GraphRAG, both in construction and query.

Index construction
#

In terms of all aspects of building the knowledge base document, LightRAG focuses mainly on the extraction and analysis of entities and relationships, while other aspects are relatively simple.

In terms of slicing, it is mainly based on token slicing, as detailed in the chunking_by_token_size method.

The following focuses mainly on constructing parsing entities and constructing graphs.

The parsing part mainly passes the chunked chunk to the large model, which summarizes entity, relationship and keywords. entity will be used as the node in the graph, and relationship will be used as the edge in the graph.

The specific prompt is as follows:


## https://github.com/HKUDS/LightRAG/blob/main/lightrag/prompt.py

PROMPTS["entity_extraction"] = """-Goal-
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, use same language as input text. If English, capitalized the name.
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)

3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"{tuple_delimiter}<high_level_keywords>)

4. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.

5. When finished, output {completion_delimiter}

######################
-Examples-
######################
Example 1:

Entity_types: [person, technology, mission, organization, location]
Text:
while Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor's authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan's shared commitment to discovery was an unspoken rebellion against Cruz's narrowing vision of control and order.

Then Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. "If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us."

The underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor's, a wordless clash of wills softening into an uneasy truce.

It was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths
################
Output:
("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is a character who experiences frustration and is observant of the dynamics among other characters."){record_delimiter}
("entity"{tuple_delimiter}"Taylor"{tuple_delimiter}"person"{tuple_delimiter}"Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective."){record_delimiter}
("entity"{tuple_delimiter}"Jordan"{tuple_delimiter}"person"{tuple_delimiter}"Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device."){record_delimiter}
("entity"{tuple_delimiter}"Cruz"{tuple_delimiter}"person"{tuple_delimiter}"Cruz is associated with a vision of control and order, influencing the dynamics among other characters."){record_delimiter}
("entity"{tuple_delimiter}"The Device"{tuple_delimiter}"technology"{tuple_delimiter}"The Device is central to the story, with potential game-changing implications, and is revered by Taylor."){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Taylor"{tuple_delimiter}"Alex is affected by Taylor's authoritarian certainty and observes changes in Taylor's attitude towards the device."{tuple_delimiter}"power dynamics, perspective shift"{tuple_delimiter}7){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Jordan"{tuple_delimiter}"Alex and Jordan share a commitment to discovery, which contrasts with Cruz's vision."{tuple_delimiter}"shared goals, rebellion"{tuple_delimiter}6){record_delimiter}
("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"Jordan"{tuple_delimiter}"Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."{tuple_delimiter}"conflict resolution, mutual respect"{tuple_delimiter}8){record_delimiter}
("relationship"{tuple_delimiter}"Jordan"{tuple_delimiter}"Cruz"{tuple_delimiter}"Jordan's commitment to discovery is in rebellion against Cruz's vision of control and order."{tuple_delimiter}"ideological conflict, rebellion"{tuple_delimiter}5){record_delimiter}
("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"The Device"{tuple_delimiter}"Taylor shows reverence towards the device, indicating its importance and potential impact."{tuple_delimiter}"reverence, technological significance"{tuple_delimiter}9){record_delimiter}
("content_keywords"{tuple_delimiter}"power dynamics, ideological conflict, discovery, rebellion"){completion_delimiter}
#############################
Example 2:

Entity_types: [person, technology, mission, organization, location]
Text:
They were no longer mere operatives; they had become guardians of a threshold, keepers of a message from a realm beyond stars and stripes. This elevation in their mission could not be shackled by regulations and established protocols—it demanded a new perspective, a new resolve.

Tension threaded through the dialogue of beeps and static as communications with Washington buzzed in the background. The team stood, a portentous air enveloping them. It was clear that the decisions they made in the ensuing hours could redefine humanity's place in the cosmos or condemn them to ignorance and potential peril.

Their connection to the stars solidified, the group moved to address the crystallizing warning, shifting from passive recipients to active participants. Mercer's latter instincts gained precedence— the team's mandate had evolved, no longer solely to observe and report but to interact and prepare. A metamorphosis had begun, and Operation: Dulce hummed with the newfound frequency of their daring, a tone set not by the earthly
#############
Output:
("entity"{tuple_delimiter}"Washington"{tuple_delimiter}"location"{tuple_delimiter}"Washington is a location where communications are being received, indicating its importance in the decision-making process."){record_delimiter}
("entity"{tuple_delimiter}"Operation: Dulce"{tuple_delimiter}"mission"{tuple_delimiter}"Operation: Dulce is described as a mission that has evolved to interact and prepare, indicating a significant shift in objectives and activities."){record_delimiter}
("entity"{tuple_delimiter}"The team"{tuple_delimiter}"organization"{tuple_delimiter}"The team is portrayed as a group of individuals who have transitioned from passive observers to active participants in a mission, showing a dynamic change in their role."){record_delimiter}
("relationship"{tuple_delimiter}"The team"{tuple_delimiter}"Washington"{tuple_delimiter}"The team receives communications from Washington, which influences their decision-making process."{tuple_delimiter}"decision-making, external influence"{tuple_delimiter}7){record_delimiter}
("relationship"{tuple_delimiter}"The team"{tuple_delimiter}"Operation: Dulce"{tuple_delimiter}"The team is directly involved in Operation: Dulce, executing its evolved objectives and activities."{tuple_delimiter}"mission evolution, active participation"{tuple_delimiter}9){completion_delimiter}
("content_keywords"{tuple_delimiter}"mission evolution, decision-making, active participation, cosmic significance"){completion_delimiter}
#############################
Example 3:

Entity_types: [person, role, technology, organization, event, location, concept]
Text:
their voice slicing through the buzz of activity. "Control may be an illusion when facing an intelligence that literally writes its own rules," they stated stoically, casting a watchful eye over the flurry of data.

"It's like it's learning to communicate," offered Sam Rivera from a nearby interface, their youthful energy boding a mix of awe and anxiety. "This gives talking to strangers" a whole new meaning."

Alex surveyed his team—each face a study in concentration, determination, and not a small measure of trepidation. "This might well be our first contact," he acknowledged, "And we need to be ready for whatever answers back."

Together, they stood on the edge of the unknown, forging humanity's response to a message from the heavens. The ensuing silence was palpable—a collective introspection about their role in this grand cosmic play, one that could rewrite human history.

The encrypted dialogue continued to unfold, its intricate patterns showing an almost uncanny anticipation
#############
Output:
("entity"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"person"{tuple_delimiter}"Sam Rivera is a member of a team working on communicating with an unknown intelligence, showing a mix of awe and anxiety."){record_delimiter}
("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is the leader of a team attempting first contact with an unknown intelligence, acknowledging the significance of their task."){record_delimiter}
("entity"{tuple_delimiter}"Control"{tuple_delimiter}"concept"{tuple_delimiter}"Control refers to the ability to manage or govern, which is challenged by an intelligence that writes its own rules."){record_delimiter}
("entity"{tuple_delimiter}"Intelligence"{tuple_delimiter}"concept"{tuple_delimiter}"Intelligence here refers to an unknown entity capable of writing its own rules and learning to communicate."){record_delimiter}
("entity"{tuple_delimiter}"First Contact"{tuple_delimiter}"event"{tuple_delimiter}"First Contact is the potential initial communication between humanity and an unknown intelligence."){record_delimiter}
("entity"{tuple_delimiter}"Humanity's Response"{tuple_delimiter}"event"{tuple_delimiter}"Humanity's Response is the collective action taken by Alex's team in response to a message from an unknown intelligence."){record_delimiter}
("relationship"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"Intelligence"{tuple_delimiter}"Sam Rivera is directly involved in the process of learning to communicate with the unknown intelligence."{tuple_delimiter}"communication, learning process"{tuple_delimiter}9){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"First Contact"{tuple_delimiter}"Alex leads the team that might be making the First Contact with the unknown intelligence."{tuple_delimiter}"leadership, exploration"{tuple_delimiter}10){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Humanity's Response"{tuple_delimiter}"Alex and his team are the key figures in Humanity's Response to the unknown intelligence."{tuple_delimiter}"collective action, cosmic significance"{tuple_delimiter}8){record_delimiter}
("relationship"{tuple_delimiter}"Control"{tuple_delimiter}"Intelligence"{tuple_delimiter}"The concept of Control is challenged by the Intelligence that writes its own rules."{tuple_delimiter}"power dynamics, autonomy"{tuple_delimiter}7){record_delimiter}
("content_keywords"{tuple_delimiter}"first contact, control, communication, cosmic significance"){completion_delimiter}
#############################
-Real Data-
######################
Entity_types: {entity_types}
Text: {input_text}
######################
Output:
"""

You can see that this is quite similar to the prompt words of GraphRAG. Then LightRAG will parse the entity and relationship in it and store them in the database.

The entity mainly contains the name name, type type, description description and source_id,

and relationship mainly contains src_id, tgt_id, weight weight, description description**, **keywords keywords and source_id

Here, source_id refers to the chunk key; here, src_id and tgt_id refer to entity_name

According to these, the Graph of the knowledge base can be roughly constructed. The part of LightRAG that constructs the index is these contents

Queries
#

LightRAG provides 4 query modes: ① local ② global ③ hybrid ④ naive

The fourth mode is the classic RAG query: query topK chunks based on the Query, and then pass the Query and chunks to the large model to obtain the final result.

In this section, we will focus on the first three modes of query. The first three modes of query are roughly similar and can be summarized in the following three steps:

  1. Call the large model to parse the keywords in the query, and the prompt is as follows. As for keywords, LightRAG divides them into two categories:
    1. low_level_keywords: mainly refers to keywords that are very specific and can be mapped to entities
    2. high_level_keywords: mainly refers to a category of keywords that generally represent broad concepts
## https://github.com/HKUDS/LightRAG/blob/main/lightrag/prompt.py

PROMPTS["keywords_extraction"] = """---Role---

You are a helpful assistant tasked with identifying both high-level and low-level keywords in the user's query.

---Goal---

Given the query, list both high-level and low-level keywords. High-level keywords focus on overarching concepts or themes, while low-level keywords focus on specific entities, details, or concrete terms.

---Instructions---

- Output the keywords in JSON format.
- The JSON should have two keys:
  - "high_level_keywords" for overarching concepts or themes.
  - "low_level_keywords" for specific entities or details.

######################
-Examples-
######################
Example 1:

Query: "How does international trade influence global economic stability?"
################
Output:
{{
  "high_level_keywords": ["International trade", "Global economic stability", "Economic impact"],
  "low_level_keywords": ["Trade agreements", "Tariffs", "Currency exchange", "Imports", "Exports"]
}}
#############################
Example 2:

Query: "What are the environmental consequences of deforestation on biodiversity?"
################
Output:
{{
  "high_level_keywords": ["Environmental consequences", "Deforestation", "Biodiversity loss"],
  "low_level_keywords": ["Species extinction", "Habitat destruction", "Carbon emissions", "Rainforest", "Ecosystem"]
}}
#############################
Example 3:

Query: "What is the role of education in reducing poverty?"
################
Output:
{{
  "high_level_keywords": ["Education", "Poverty reduction", "Socioeconomic development"],
  "low_level_keywords": ["School access", "Literacy rates", "Job training", "Income inequality"]
}}
#############################
-Real Data-
######################
Query: {query}
######################
Output:

"""
  1. Construct the context based on keywords. The context constructed in these three modes is the same. You can see the specific code section below
    1. In the local mode, the resolved low_level_keywords keyword vector is used to query the topK entity data, and then the related relations and original chunks are obtained based on the queried entities to construct the overall context
    2. In the global mode, the resolved high_level_keywords keyword vector is used to query the topK relation data, and then the related entities and original chunks are obtained based on the queried relations to construct the overall context
    3. In hybrid mode, the parsed low_level_keywords and high_level_keywords are used to obtain the local and global context respectively, and then the entities, relations and chunks in these two modes are combined to construct the context
  ## https://github.com/HKUDS/LightRAG/blob/main/lightrag/operate.py
  
  context = f"""
  -----Entities-----
  ```csv
  {entities_context}
  ```
 -----Relationships-----
  ```csv
  {relations_context}
  ```
 -----Sources-----
  ```csv
  {text_units_context}
  ```
 """
  1. Based on the context data constructed in the second step, the large model is requested to obtain the final result. The context data is given to the model by concatenating it to the system persona, and the user prompt is the Query itself. The persona’s prompt template is as follows:

    ## https://github.com/HKUDS/LightRAG/blob/main/lightrag/prompt.py
    
    PROMPTS["rag_response"] = """---Role---
    
    You are a helpful assistant responding to questions about data in the tables provided.
    
    ---Goal---
    
    Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge.
    If you don't know the answer, just say so. Do not make anything up.
    Do not include information where the supporting evidence for it is not provided.
    
    ---Target response length and format---
    
    {response_type}
    
    ---Data tables---
    
    {context_data}
    
    Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
    """
    

Persistence
#

We mentioned earlier that LightRAG persists the entire graph to improve query efficiency. Let’s take a look at how it is persisted.

There are three main types of data stored in LightRAG

  1. kv data: used to store the kv data of the original document and chunk, and also used to cache the results of large models. So far, LightRAG supports in-memory kv caching and Oracle.
  2. Vector data: used to store the vector data of chunk, entity and relation. So far, LightRAG supports nano-vectordb and Oracle.
  3. graph graph data: used to store the entire graph structure. So far LightRAG supports NetworkX, Neo4J and Oracle.

Here we will mainly look at the table structure of Oracle

## https://github.com/HKUDS/LightRAG/blob/main/lightrag/kg/oracle_impl.py

TABLES = {
   ## full doc
    "LIGHTRAG_DOC_FULL": {
        "ddl": """CREATE TABLE LIGHTRAG_DOC_FULL (
                    id varchar(256)PRIMARY KEY,
                    workspace varchar(1024),
                    doc_name varchar(1024),
                    content CLOB,
                    meta JSON,
                    createtime TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    updatetime TIMESTAMP DEFAULT NULL
                    )"""
    },
    
    ## chunk text
    "LIGHTRAG_DOC_CHUNKS": {
        "ddl": """CREATE TABLE LIGHTRAG_DOC_CHUNKS (
                    id varchar(256) PRIMARY KEY,
                    workspace varchar(1024),
                    full_doc_id varchar(256),
                    chunk_order_index NUMBER,
                    tokens NUMBER,
                    content CLOB,
                    content_vector VECTOR,
                    createtime TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    updatetime TIMESTAMP DEFAULT NULL
                    )"""
    },
    
    ## entity
    "LIGHTRAG_GRAPH_NODES": {
        "ddl": """CREATE TABLE LIGHTRAG_GRAPH_NODES (
                    id NUMBER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
                    workspace varchar(1024),
                    name varchar(2048),
                    entity_type varchar(1024),
                    description CLOB,
                    source_chunk_id varchar(256),
                    content CLOB,
                    content_vector VECTOR,
                    createtime TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    updatetime TIMESTAMP DEFAULT NULL
                    )"""
    },
    
    ## relation
    "LIGHTRAG_GRAPH_EDGES": {
        "ddl": """CREATE TABLE LIGHTRAG_GRAPH_EDGES (
                    id NUMBER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
                    workspace varchar(1024),
                    source_name varchar(2048),
                    target_name varchar(2048),
                    weight NUMBER,
                    keywords CLOB,
                    description CLOB,
                    source_chunk_id varchar(256),
                    content CLOB,
                    content_vector VECTOR,
                    createtime TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    updatetime TIMESTAMP DEFAULT NULL
                    )"""
    },
    
    ## cache
    "LIGHTRAG_LLM_CACHE": {
        "ddl": """CREATE TABLE LIGHTRAG_LLM_CACHE (
                    id varchar(256) PRIMARY KEY,
                    send clob,
                    return clob,
                    model varchar(1024),
                    createtime TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    updatetime TIMESTAMP DEFAULT NULL
                    )"""
    },
    
    ## graph
    "LIGHTRAG_GRAPH": {
        "ddl": """CREATE OR REPLACE PROPERTY GRAPH lightrag_graph
                VERTEX TABLES (
                    lightrag_graph_nodes KEY (id)
                        LABEL entity
                        PROPERTIES (id,workspace,name) -- ,entity_type,description,source_chunk_id)
                )
                EDGE TABLES (
                    lightrag_graph_edges KEY (id)
                        SOURCE KEY (source_name) REFERENCES lightrag_graph_nodes(name)
                        DESTINATION KEY (target_name) REFERENCES lightrag_graph_nodes(name)
                        LABEL  has_relation
                        PROPERTIES (id,workspace,source_name,target_name) -- ,weight, keywords,description,source_chunk_id)
                ) OPTIONS(ALLOW MIXED PROPERTY TYPES)"""
    },
}

Related #

HKUDS/LightRAG

“LightRAG: Simple and Fast Retrieval-Augmented Generation”

Python
10313
1298

LightRAG: Simple and Fast Retrieval-Augmented Generation

RAPTOR RAG
#

Overview
#

RAPTOR is a RAG method proposed by Stanford in January. The main idea is to cluster chunks, have the large model summarize the summary of this cluster, and then repeat, building a tree structure from bottom to top.

RAPTOR.png

Index construction
#

As mentioned above, the idea behind RAPTOR is to use chunks as leaf nodes, and then cluster → summarize → cluster → summarize → … until the root node. Overall, it is not easy to understand. Let’s go straight to the code

## https://github.com/parthsarthi03/raptor/blob/master/raptor/cluster_tree_builder.py

 class ClusterTreeBuilder(TreeBuilder):
 
     def construct_tree(
        self,
        current_level_nodes: Dict[int, Node],
        all_tree_nodes: Dict[int, Node],
        layer_to_nodes: Dict[int, List[Node]],
        use_multithreading: bool = False,
    ) -> Dict[int, Node]:
        logging.info("Using Cluster TreeBuilder")

        next_node_index = len(all_tree_nodes)

     ## summarize nodes in caluster
        def process_cluster(
            cluster, new_level_nodes, next_node_index, summarization_length, lock
        ):
            node_texts = get_text(cluster)
       
       ## get nodes in cluster summary by llm
            summarized_text = self.summarize(
                context=node_texts,
                max_tokens=summarization_length,
            )

            logging.info(
                f"Node Texts Length: {len(self.tokenizer.encode(node_texts))}, Summarized Text Length: {len(self.tokenizer.encode(summarized_text))}"
            )

            __, new_parent_node = self.create_node(
                next_node_index, summarized_text, {node.index for node in cluster}
            )

            with lock:
                new_level_nodes[next_node_index] = new_parent_node

     ## num of layers is 5
        for layer in range(self.num_layers):

            new_level_nodes = {}

            logging.info(f"Constructing Layer {layer}")

            node_list_current_layer = get_node_list(current_level_nodes)

            if len(node_list_current_layer) <= self.reduction_dimension + 1:
                self.num_layers = layer
                logging.info(
                    f"Stopping Layer construction: Cannot Create More Layers. Total Layers in tree: {layer}"
                )
                break
       
       ## get cluster
            clusters = self.clustering_algorithm.perform_clustering(
                node_list_current_layer,
                self.cluster_embedding_model,
                reduction_dimension=self.reduction_dimension,
                **self.clustering_params,
            )

            lock = Lock()

            summarization_length = self.summarization_length
            logging.info(f"Summarization Length: {summarization_length}")

            if use_multithreading:
                with ThreadPoolExecutor() as executor:
                    for cluster in clusters:
                        executor.submit(
                            process_cluster,
                            cluster,
                            new_level_nodes,
                            next_node_index,
                            summarization_length,
                            lock,
                        )
                        next_node_index += 1
                    executor.shutdown(wait=True)

            else:
                for cluster in clusters:
                    process_cluster(
                        cluster,
                        new_level_nodes,
                        next_node_index,
                        summarization_length,
                        lock,
                    )
                    next_node_index += 1

            layer_to_nodes[layer + 1] = list(new_level_nodes.values())
            current_level_nodes = new_level_nodes
            all_tree_nodes.update(new_level_nodes)

            tree = Tree(
                all_tree_nodes,
                layer_to_nodes[layer + 1],
                layer_to_nodes[0],
                layer + 1,
                layer_to_nodes,
            )

        return current_level_nodes

The prompt summarized here is relatively simple, and it just asks the model to summarize the nodes for a large model. It is not listed here.

Query
#

This part is also relatively simple. RAPTOR provides two search methods here: ① search all nodes of a tree species, and ② search all nodes of a specific layer. The default is the first method.

The specific process of the search is to first obtain the embedding of the query and the embeddings of all nodes to be searched, and then calculate the distance between the query embedding and the node embedding and sort them. Finally, the top_k nodes are selected as the context.

## https://github.com/parthsarthi03/raptor/blob/master/raptor/tree_retriever.py

class TreeRetriever(BaseRetriever):

    def retrieve(
        self,
        query: str,
        start_layer: int = None,
        num_layers: int = None,
        top_k: int = 10, 
        max_tokens: int = 3500,
        collapse_tree: bool = True,
        return_layer_information: bool = False,
    ) -> str:
      
      ...
    
        if collapse_tree:
           ## default 
           ## Retrieves the most relevant nodes from the all tree nodes based on query
            logging.info(f"Using collapsed_tree")
            selected_nodes, context = self.retrieve_information_collapse_tree(
                query, top_k, max_tokens
            )
        else:
           ## Retrieves the most relevant nodes from the specific layers tree nodes based on query
            layer_nodes = self.tree.layer_to_nodes[start_layer]
            selected_nodes, context = self.retrieve_information(
                layer_nodes, query, num_layers
            )

Related #

parthsarthi03/raptor

The official implementation of RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Python
984
136

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

Extension
#

Context Retrieval
#

This is an improvement on the traditional RAG proposed by Anthropic.

The article mentions that the long context capability of large models has been greatly improved, so a simpler and more brutal RAG method is to add a summary of the full text + current chunk to each chunk, so that each chunk has the capability of full text context. This can greatly reduce the problem of poor search results when searching for chunks because the chunks have no contextual information. However, this method does not have any relationship between the original documents.

The prompt for summarizing the chunk is as follows:

<document> 
{{WHOLE_DOCUMENT}} 
</document> 
Here is the chunk we want to situate within the whole document 
<chunk> 
{{CHUNK_CONTENT}} 
</chunk> 
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else. 

Contextual Retrieval.png

Introducing Contextual Retrieval

RAG evaluation
#

Amazon proposed in June an open source library that can evaluate the overall performance and efficiency of RAG. If you are interested, you can have a look

amazon-science/RAGChecker

RAGChecker: A Fine-grained Framework For Diagnosing RAG

Python
584
48

Summary
#

To conclude, GraphRAG currently seems to be a very comprehensive solution, but its problem is that the token consumption during construction and query is much higher than other methods, and the query efficiency is low.

As a result, many RAGs that optimize GraphRAG have been derived, and LightRAG is one of the better ones. Looking at LightRAG’s solution, you can also clearly feel that the token consumption and retrieval efficiency are indeed much better.

RAPTOR is used less frequently, and the overall solution is not very complicated. As for the contextual search proposed by Anthropic, it has made a lot of optimizations at the chunk level. If there are only a few documents, this method is not a bad way to go, and it is simple and straightforward enough.

As for which method we should ultimately choose, we need to assess the size of the document. If it is relatively small, then contextual search is even sufficient, and we just need to pass as much context as possible to the larger model. If the size increases further, then we can consider LightRAG. If it is some enterprise-level document library that will be maintained and developed by dedicated personnel, then choosing GraphRAG is of course the best option, but the maintenance and resource costs are also obvious.

Related

Rerank Models
·2502 words·12 mins
search AI RAG
With the popularity of the Transformer architecture, many Embedding and Rerank models are now based on this architecture. Taking this opportunity, we will sort out the process and history of the research, and take stock of the architectures adopted by several well-known Rerank models and the companies that developed them. Finally, we will return to the topic and briefly discuss whether Rerank should be used in RAG scenarios.
Mixed Expert (MoE) Model Notes
·1388 words·7 mins
MoE Large Model AI Paper Reading
This article mainly sorts out the relevant concepts of the hybrid expert model (MoE), and introduces the architectures and optimization methods of several open source MoE models, such as GShard, Switch Transformers, DeepSeek-MoE, and LLaMA-MoE. The characteristics and optimization methods of these models are also introduced.
Vector similarity search methods
·3244 words·7 mins
Search algorithm RAG Vector database
This paper provides a detailed introduction to various vector similarity search methods, such as KD trees, IVF inverted indexes, HNSW and LSH. It provides a detailed introduction from data structures to algorithm implementations by analyzing the specific implementations in the source codes of Annoy, Faiss, PGVector and FALCONN.