引言 #
目前为止,基于大模型的最有实际落地场景的主要还是 RAG;同时相较于去年只是对文档简单切片入库查询的方法,今年提出了很多种创新且有效提升 RAG 效果的算法或方法,比如微软开源的GraphRAG、Anthropic 团队提出的 Contextual Retrieval、港大提出的 LightRAG 等等。
本文旨在总结这些 RAG 方法,从论文和源码角度理解他们的设计原则和具体的代码实现。
GraphRAG #
概要 #
GraphRAG 很知名了,先说说提出他的痛点,主要是想解决在用户提出了一个相对大或全局的问题,无法检索某个或者某几个文档来做出回答;比如在股票领域中用户问某个行业板块的、或是在法律场景中用户问某一类相似案件等等。这类问题可能不仅仅涉及几个文档,用传统的 RAG 方法,检索只能调整匹配的文档分片以此来给大模型更多的知识。但这有会引发另一个问题,即大模型的上下文长度限制(不过目前基础模型层面已经大大降低了这种限制,像是 Claude-3.5 支持了 200k token,大部分开源模型也有支持 128k 上下文的模型等等)。
针对这个问题 GraphRAG 提出的解法就是在构建知识库的时候将文档构建成一个结构化的图,
索引构建 #
GraphRAG 将这部分主要分为了 5 部:
- Document workflow:主要用于原始文档解析和 embedding,注意不包括切片
- Chunk workflow:对原始文档进行切片 (Chunk),转换为 Text Unit 对象,然后对 Text Unit 做 embedding
- Graph workflow:解析 Chunk 中的内容,提取出名称、类型、描述和 Chunk 之间的关系,转化为一个元素对象,此时 Graph 已初步形成;然后根据解析出来的特征元素重新总结 Chunk 中的描述信息;最后再将 Graph 中的所有节点做分类分为不同的群组,多个群组又会构成一个 Community 社区,这里最终会形成一个多层的社区,即 Graph 社区
- Community workflow:前面已经构建了 Graph 社区,这步主要是让大模型总结多个社区及社区中的群组,生成描述信息
- Covariate workflow:最后这一步是一个可选项,主要是提取第二步中 Text Unit 对象中的协变量,本文不会重点讲这里
## https://github.com/microsoft/graphrag/blob/main/graphrag/index/create_pipeline_config.py
def create_pipeline_config(settings: GraphRagConfig, verbose=False) -> PipelineConfig:
...
skip_workflows = settings.skip_workflows
embedded_fields = _get_embedded_fields(settings)
covariates_enabled = (
settings.claim_extraction.enabled
and create_final_covariates not in skip_workflows
)
result = PipelineConfig(
root_dir=settings.root_dir,
input=_get_pipeline_input_config(settings),
reporting=_get_reporting_config(settings),
storage=_get_storage_config(settings, settings.storage),
update_index_storage=_get_storage_config(
settings, settings.update_index_storage
),
cache=_get_cache_config(settings),
## workflow 执行顺序
workflows=[
## 1. create_final_documents
*_document_workflows(settings, embedded_fields),
## 1. create_base_text_units
## 2. create_final_text_units
*_text_unit_workflows(settings, covariates_enabled, embedded_fields),
## 1. create_base_entity_graph
## 2. create_final_entities
## 3. create_final_relationships
## 4. create_final_nodes
*_graph_workflows(settings, embedded_fields),
## 1. create_final_communities
## 2. create_final_community_reports
*_community_workflows(settings, covariates_enabled, embedded_fields),
## 1. create_final_covariates
*(_covariate_workflows(settings) if covariates_enabled else []),
],
)
## Remove any workflows that were specified to be skipped
log.info("skipping workflows %s", ",".join(skip_workflows))
## 执行 workfl
result.workflows = [w for w in result.workflows if w.name not in skip_workflows]
return result
Source Document → Text Chunks #
这一步主要是将文档进行切分为块,GraphRAG 官方提供了2种方式:
- 基于 token 切分,这部分实现主要是参考了
langchain
中的text_splitter
,通过设定chunk_size
和chunk_overlap
参数做规格切片;token 计算这部分使用了cl100k_base
模型 - 基于 sentence 句子分词,这部分直接使用的 Natural Language Toolkit 包中的
nltk.tokenize.sent_tokenize(text, language='english')
基于 token 切分这部分需要注意分块的大小即
tokens_per_chunk
。如果每块过小,则调用模型和生成索引时间会更长成本更大;但如果每块过小,可能又会影响召回率和精度。
这部分的核心代码可以看 graphrag/index/operations/chunk_text/strategies.py
## https://github.com/microsoft/graphrag/blob/main/graphrag/index/operations/chunk_text/strategies.py
## 按 token 切片
def run_tokens(
input: list[str], args: dict[str, Any], tick: ProgressTicker
) -> Iterable[TextChunk]:
"""Chunks text into chunks based on encoding tokens."""
tokens_per_chunk = args.get("chunk_size", defs.CHUNK_SIZE)
chunk_overlap = args.get("chunk_overlap", defs.CHUNK_OVERLAP)
encoding_name = args.get("encoding_name", defs.ENCODING_MODEL)
enc = tiktoken.get_encoding(encoding_name)
def encode(text: str) -> list[int]:
if not isinstance(text, str):
text = f"{text}"
return enc.encode(text)
def decode(tokens: list[int]) -> str:
return enc.decode(tokens)
return _split_text_on_tokens(
input,
Tokenizer(
chunk_overlap=chunk_overlap,
tokens_per_chunk=tokens_per_chunk,
encode=encode,
decode=decode,
),
tick,
)
## Adapted from - https://github.com/langchain-ai/langchain/blob/77b359edf5df0d37ef0d539f678cf64f5557cb54/libs/langchain/langchain/text_splitter.py#L471
## So we could have better control over the chunking process
def _split_text_on_tokens(
texts: list[str], enc: Tokenizer, tick: ProgressTicker
) -> list[TextChunk]:
"""Split incoming text and return chunks."""
result = []
mapped_ids = []
for source_doc_idx, text in enumerate(texts):
encoded = enc.encode(text)
tick(1)
mapped_ids.append((source_doc_idx, encoded))
input_ids: list[tuple[int, int]] = [
(source_doc_idx, id) for source_doc_idx, ids in mapped_ids for id in ids
]
start_idx = 0
cur_idx = min(start_idx + enc.tokens_per_chunk, len(input_ids))
chunk_ids = input_ids[start_idx:cur_idx]
while start_idx < len(input_ids):
chunk_text = enc.decode([id for _, id in chunk_ids])
doc_indices = list({doc_idx for doc_idx, _ in chunk_ids})
result.append(
TextChunk(
text_chunk=chunk_text,
source_doc_indices=doc_indices,
n_tokens=len(chunk_ids),
)
)
start_idx += enc.tokens_per_chunk - enc.chunk_overlap
cur_idx = min(start_idx + enc.tokens_per_chunk, len(input_ids))
chunk_ids = input_ids[start_idx:cur_idx]
return result
## 按 sentence 句子切片
def run_sentences(
input: list[str], _args: dict[str, Any], tick: ProgressTicker
) -> Iterable[TextChunk]:
"""Chunks text into multiple parts by sentence."""
for doc_idx, text in enumerate(input):
sentences = nltk.sent_tokenize(text)
for sentence in sentences:
yield TextChunk(
text_chunk=sentence,
source_doc_indices=[doc_idx],
)
tick(1)
Text Chunks → Element Instances #
这部分主要是提取每个 chunk 中的特征元素,并将其构建成元素实体。这些特征元素可能包括名称、类型、描述等,以便后续可以建立每个 chunk 之间的关系。同样的这部分 GraphRAG 也提供了 2 种方式:
- graph intelligence:这个指的就是用大模型提取,也是我们重点关注的地方
- nltk:同样的,这种方法也是用的 Natural Language Toolkit 包中的
nltk.chunk.ne_chunk(tagged_tokens, binary=False)
这里我们具体说说方法一,主要看看 GraphRAG 的提示词
GRAPH_EXTRACTION_PROMPT = """
-Goal-
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)
2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>)
3. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.
4. When finished, output {completion_delimiter}
######################
-Examples-
######################
Example 1:
Entity_types: ORGANIZATION,PERSON
Text:
The Verdantis's Central Institution is scheduled to meet on Monday and Thursday, with the institution planning to release its latest policy decision on Thursday at 1:30 p.m. PDT, followed by a press conference where Central Institution Chair Martin Smith will take questions. Investors expect the Market Strategy Committee to hold its benchmark interest rate steady in a range of 3.5%-3.75%.
######################
Output:
("entity"{tuple_delimiter}CENTRAL INSTITUTION{tuple_delimiter}ORGANIZATION{tuple_delimiter}The Central Institution is the Federal Reserve of Verdantis, which is setting interest rates on Monday and Thursday)
{record_delimiter}
("entity"{tuple_delimiter}MARTIN SMITH{tuple_delimiter}PERSON{tuple_delimiter}Martin Smith is the chair of the Central Institution)
{record_delimiter}
("entity"{tuple_delimiter}MARKET STRATEGY COMMITTEE{tuple_delimiter}ORGANIZATION{tuple_delimiter}The Central Institution committee makes key decisions about interest rates and the growth of Verdantis's money supply)
{record_delimiter}
("relationship"{tuple_delimiter}MARTIN SMITH{tuple_delimiter}CENTRAL INSTITUTION{tuple_delimiter}Martin Smith is the Chair of the Central Institution and will answer questions at a press conference{tuple_delimiter}9)
{completion_delimiter}
######################
Example 2:
Entity_types: ORGANIZATION
Text:
TechGlobal's (TG) stock skyrocketed in its opening day on the Global Exchange Thursday. But IPO experts warn that the semiconductor corporation's debut on the public markets isn't indicative of how other newly listed companies may perform.
TechGlobal, a formerly public company, was taken private by Vision Holdings in 2014. The well-established chip designer says it powers 85% of premium smartphones.
######################
Output:
("entity"{tuple_delimiter}TECHGLOBAL{tuple_delimiter}ORGANIZATION{tuple_delimiter}TechGlobal is a stock now listed on the Global Exchange which powers 85% of premium smartphones)
{record_delimiter}
("entity"{tuple_delimiter}VISION HOLDINGS{tuple_delimiter}ORGANIZATION{tuple_delimiter}Vision Holdings is a firm that previously owned TechGlobal)
{record_delimiter}
("relationship"{tuple_delimiter}TECHGLOBAL{tuple_delimiter}VISION HOLDINGS{tuple_delimiter}Vision Holdings formerly owned TechGlobal from 2014 until present{tuple_delimiter}5)
{completion_delimiter}
######################
Example 3:
Entity_types: ORGANIZATION,GEO,PERSON
Text:
Five Aurelians jailed for 8 years in Firuzabad and widely regarded as hostages are on their way home to Aurelia.
The swap orchestrated by Quintara was finalized when $8bn of Firuzi funds were transferred to financial institutions in Krohaara, the capital of Quintara.
The exchange initiated in Firuzabad's capital, Tiruzia, led to the four men and one woman, who are also Firuzi nationals, boarding a chartered flight to Krohaara.
They were welcomed by senior Aurelian officials and are now on their way to Aurelia's capital, Cashion.
The Aurelians include 39-year-old businessman Samuel Namara, who has been held in Tiruzia's Alhamia Prison, as well as journalist Durke Bataglani, 59, and environmentalist Meggie Tazbah, 53, who also holds Bratinas nationality.
######################
Output:
("entity"{tuple_delimiter}FIRUZABAD{tuple_delimiter}GEO{tuple_delimiter}Firuzabad held Aurelians as hostages)
{record_delimiter}
("entity"{tuple_delimiter}AURELIA{tuple_delimiter}GEO{tuple_delimiter}Country seeking to release hostages)
{record_delimiter}
("entity"{tuple_delimiter}QUINTARA{tuple_delimiter}GEO{tuple_delimiter}Country that negotiated a swap of money in exchange for hostages)
{record_delimiter}
{record_delimiter}
("entity"{tuple_delimiter}TIRUZIA{tuple_delimiter}GEO{tuple_delimiter}Capital of Firuzabad where the Aurelians were being held)
{record_delimiter}
("entity"{tuple_delimiter}KROHAARA{tuple_delimiter}GEO{tuple_delimiter}Capital city in Quintara)
{record_delimiter}
("entity"{tuple_delimiter}CASHION{tuple_delimiter}GEO{tuple_delimiter}Capital city in Aurelia)
{record_delimiter}
("entity"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}PERSON{tuple_delimiter}Aurelian who spent time in Tiruzia's Alhamia Prison)
{record_delimiter}
("entity"{tuple_delimiter}ALHAMIA PRISON{tuple_delimiter}GEO{tuple_delimiter}Prison in Tiruzia)
{record_delimiter}
("entity"{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}PERSON{tuple_delimiter}Aurelian journalist who was held hostage)
{record_delimiter}
("entity"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}PERSON{tuple_delimiter}Bratinas national and environmentalist who was held hostage)
{record_delimiter}
("relationship"{tuple_delimiter}FIRUZABAD{tuple_delimiter}AURELIA{tuple_delimiter}Firuzabad negotiated a hostage exchange with Aurelia{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}QUINTARA{tuple_delimiter}AURELIA{tuple_delimiter}Quintara brokered the hostage exchange between Firuzabad and Aurelia{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}QUINTARA{tuple_delimiter}FIRUZABAD{tuple_delimiter}Quintara brokered the hostage exchange between Firuzabad and Aurelia{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}ALHAMIA PRISON{tuple_delimiter}Samuel Namara was a prisoner at Alhamia prison{tuple_delimiter}8)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}Samuel Namara and Meggie Tazbah were exchanged in the same hostage release{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}Samuel Namara and Durke Bataglani were exchanged in the same hostage release{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}Meggie Tazbah and Durke Bataglani were exchanged in the same hostage release{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}FIRUZABAD{tuple_delimiter}Samuel Namara was a hostage in Firuzabad{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}FIRUZABAD{tuple_delimiter}Meggie Tazbah was a hostage in Firuzabad{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}FIRUZABAD{tuple_delimiter}Durke Bataglani was a hostage in Firuzabad{tuple_delimiter}2)
{completion_delimiter}
######################
-Real Data-
######################
Entity_types: {entity_types}
Text: {input_text}
######################
Output:"""
与大模型交互后的结果最终会被转换成 NetworkX Graph,即一个有权无向图。这个时候的图主要有两部分:
node
:图中的节点即特征实体本身,主要记录了原始文档、名称、类型和描述信息edge
:图中的边,主要记录 node 之间的关系和权重
Element Instances → Element Summaries #
上一步已经拿到了有权无向图,且节点之间也存在关系连接。这部分主要是在特征元素实体的基础上,使用 LLM 进一步总结摘要。这里重新生成的目的是之前是单独生成的描述,所以比较单薄没有兼顾整体的一组元素实例的上下文,且原有描述之间可能会存在相互冲突。
大模型最终生成的描述会被重新替换到 Graph
的节点和边种的描述内容中。
这部分比较核心的主要也是提示词,具体的提示词如下:
SUMMARIZE_PROMPT = """
You are a helpful assistant responsible for generating a comprehensive summary of the data provided below.
Given one or two entities, and a list of descriptions, all related to the same entity or group of entities.
Please concatenate all of these into a single, comprehensive description. Make sure to include information collected from all the descriptions.
If the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary.
Make sure it is written in third person, and include the entity names so we have the full context.
#######
-Data-
Entities: {entity_name}
Description List: {description_list}
#######
Output:
"""
Element Summaries → Graph Communities #
前两步我们得到了最终的有权无向图,这一步是想将节点分组,以此来构建“社区”。这里的每一个社区Community
是一个节点数组,每个社区有cluster_id
社区群组的 id和一组 nodes
这部分主要是用的 graspologic
包实现的,这个包主要是用于 Graph 图的统计和分析的算法。GraghRAG 依据这些信息来构建一个完成的“社区群组”,简单的说就是hierarchical_leiden
方法会把整个图 Gragh 分为多个区域,每个区域中有max_cluster_size
个节点 (GraphRAG 中默认为 10 个),这里的节点可以是一个特征元素实体 node 本身,也可以是一个子社区。
## https://github.com/microsoft/graphrag/blob/main/graphrag/index/operations/cluster_graph.py
def run_layout(strategy: dict[str, Any], graph: nx.Graph) -> Communities:
"""Run layout method definition."""
if len(graph.nodes) == 0:
log.warning("Graph has no nodes")
return []
clusters: dict[int, dict[str, list[str]]] = {}
strategy_type = strategy.get("type", GraphCommunityStrategyType.leiden)
match strategy_type:
case GraphCommunityStrategyType.leiden:
clusters = run_leiden(graph, strategy)
case _:
msg = f"Unknown clustering strategy {strategy_type}"
raise ValueError(msg)
results: Communities = []
for level in clusters:
for cluster_id, nodes in clusters[level].items():
results.append((level, cluster_id, nodes))
return results
def run_leiden(
graph: nx.Graph, args: dict[str, Any]
) -> dict[int, dict[str, list[str]]]:
"""Run method definition."""
max_cluster_size = args.get("max_cluster_size", 10)
use_lcc = args.get("use_lcc", True)
if args.get("verbose", False):
log.info(
"Running leiden with max_cluster_size=%s, lcc=%s", max_cluster_size, use_lcc
)
node_id_to_community_map = _compute_leiden_communities(
graph=graph,
max_cluster_size=max_cluster_size,
use_lcc=use_lcc,
seed=args.get("seed", 0xDEADBEEF),
)
levels = args.get("levels")
## If they don't pass in levels, use them all
if levels is None:
levels = sorted(node_id_to_community_map.keys())
results_by_level: dict[int, dict[str, list[str]]] = {}
for level in levels:
result = {}
results_by_level[level] = result
for node_id, raw_community_id in node_id_to_community_map[level].items():
community_id = str(raw_community_id)
if community_id not in result:
result[community_id] = []
result[community_id].append(node_id)
return results_by_level
## Taken from graph_intelligence & adapted
def _compute_leiden_communities(
graph: nx.Graph | nx.DiGraph,
max_cluster_size: int,
use_lcc: bool,
seed=0xDEADBEEF,
) -> dict[int, dict[str, int]]:
"""Return Leiden root communities."""
if use_lcc:
graph = stable_largest_connected_component(graph)
## https://github.com/graspologic-org/graspologic/blob/main/graspologic/partition/leiden.py
community_mapping = hierarchical_leiden(
graph, max_cluster_size=max_cluster_size, random_seed=seed
)
results: dict[int, dict[str, int]] = {}
for partition in community_mapping:
results[partition.level] = results.get(partition.level, {})
results[partition.level][partition.node] = partition.cluster
return results
Graph Communities → Community Summaries #
上面我们拿到了最终的图和里面不同的社区分布,最后一步就是为这些社区来添加摘要信息。上面我们说到社区中的节点可以是分片的特征元素实例,也可以是子社区,所以这里的摘要信息总结分为 2 种 ① 叶级社区的摘要 ② 更高级社区的摘要。
## https://github.com/microsoft/graphrag/blob/main/graphrag/prompts/index/community_report.py
COMMUNITY_REPORT_PROMPT = """
You are an AI assistant that helps a human analyst to perform general information discovery. Information discovery is the process of identifying and assessing relevant information associated with certain entities (e.g., organizations and individuals) within a network.
## Goal
Write a comprehensive report of a community, given a list of entities that belong to the community as well as their relationships and optional associated claims. The report will be used to inform decision-makers about information associated with the community and their potential impact. The content of this report includes an overview of the community's key entities, their legal compliance, technical capabilities, reputation, and noteworthy claims.
## Report Structure
The report should include the following sections:
- TITLE: community's name that represents its key entities - title should be short but specific. When possible, include representative named entities in the title.
- SUMMARY: An executive summary of the community's overall structure, how its entities are related to each other, and significant information associated with its entities.
- IMPACT SEVERITY RATING: a float score between 0-10 that represents the severity of IMPACT posed by entities within the community. IMPACT is the scored importance of a community.
- RATING EXPLANATION: Give a single sentence explanation of the IMPACT severity rating.
- DETAILED FINDINGS: A list of 5-10 key insights about the community. Each insight should have a short summary followed by multiple paragraphs of explanatory text grounded according to the grounding rules below. Be comprehensive.
Return output as a well-formed JSON-formatted string with the following format:
{{
"title": <report_title>,
"summary": <executive_summary>,
"rating": <impact_severity_rating>,
"rating_explanation": <rating_explanation>,
"findings": [
{{
"summary":<insight_1_summary>,
"explanation": <insight_1_explanation>
}},
{{
"summary":<insight_2_summary>,
"explanation": <insight_2_explanation>
}}
]
}}
## Grounding Rules
Points supported by data should list their data references as follows:
"This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."
Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (1), Entities (5, 7); Relationships (23); Claims (7, 2, 34, 64, 46, +more)]."
where 1, 5, 7, 23, 2, 34, 46, and 64 represent the id (not the index) of the relevant data record.
Do not include information where the supporting evidence for it is not provided.
## Example Input
-----------
Text:
Entities
id,entity,description
5,VERDANT OASIS PLAZA,Verdant Oasis Plaza is the location of the Unity March
6,HARMONY ASSEMBLY,Harmony Assembly is an organization that is holding a march at Verdant Oasis Plaza
Relationships
id,source,target,description
37,VERDANT OASIS PLAZA,UNITY MARCH,Verdant Oasis Plaza is the location of the Unity March
38,VERDANT OASIS PLAZA,HARMONY ASSEMBLY,Harmony Assembly is holding a march at Verdant Oasis Plaza
39,VERDANT OASIS PLAZA,UNITY MARCH,The Unity March is taking place at Verdant Oasis Plaza
40,VERDANT OASIS PLAZA,TRIBUNE SPOTLIGHT,Tribune Spotlight is reporting on the Unity march taking place at Verdant Oasis Plaza
41,VERDANT OASIS PLAZA,BAILEY ASADI,Bailey Asadi is speaking at Verdant Oasis Plaza about the march
43,HARMONY ASSEMBLY,UNITY MARCH,Harmony Assembly is organizing the Unity March
Output:
{{
"title": "Verdant Oasis Plaza and Unity March",
"summary": "The community revolves around the Verdant Oasis Plaza, which is the location of the Unity March. The plaza has relationships with the Harmony Assembly, Unity March, and Tribune Spotlight, all of which are associated with the march event.",
"rating": 5.0,
"rating_explanation": "The impact severity rating is moderate due to the potential for unrest or conflict during the Unity March.",
"findings": [
{{
"summary": "Verdant Oasis Plaza as the central location",
"explanation": "Verdant Oasis Plaza is the central entity in this community, serving as the location for the Unity March. This plaza is the common link between all other entities, suggesting its significance in the community. The plaza's association with the march could potentially lead to issues such as public disorder or conflict, depending on the nature of the march and the reactions it provokes. [Data: Entities (5), Relationships (37, 38, 39, 40, 41,+more)]"
}},
{{
"summary": "Harmony Assembly's role in the community",
"explanation": "Harmony Assembly is another key entity in this community, being the organizer of the march at Verdant Oasis Plaza. The nature of Harmony Assembly and its march could be a potential source of threat, depending on their objectives and the reactions they provoke. The relationship between Harmony Assembly and the plaza is crucial in understanding the dynamics of this community. [Data: Entities(6), Relationships (38, 43)]"
}},
{{
"summary": "Unity March as a significant event",
"explanation": "The Unity March is a significant event taking place at Verdant Oasis Plaza. This event is a key factor in the community's dynamics and could be a potential source of threat, depending on the nature of the march and the reactions it provokes. The relationship between the march and the plaza is crucial in understanding the dynamics of this community. [Data: Relationships (39)]"
}},
{{
"summary": "Role of Tribune Spotlight",
"explanation": "Tribune Spotlight is reporting on the Unity March taking place in Verdant Oasis Plaza. This suggests that the event has attracted media attention, which could amplify its impact on the community. The role of Tribune Spotlight could be significant in shaping public perception of the event and the entities involved. [Data: Relationships (40)]"
}}
]
}}
## Real Data
Use the following text for your answer. Do not make anything up in your answer.
Text:
{input_text}
The report should include the following sections:
- TITLE: community's name that represents its key entities - title should be short but specific. When possible, include representative named entities in the title.
- SUMMARY: An executive summary of the community's overall structure, how its entities are related to each other, and significant information associated with its entities.
- IMPACT SEVERITY RATING: a float score between 0-10 that represents the severity of IMPACT posed by entities within the community. IMPACT is the scored importance of a community.
- RATING EXPLANATION: Give a single sentence explanation of the IMPACT severity rating.
- DETAILED FINDINGS: A list of 5-10 key insights about the community. Each insight should have a short summary followed by multiple paragraphs of explanatory text grounded according to the grounding rules below. Be comprehensive.
Return output as a well-formed JSON-formatted string with the following format:
{{
"title": <report_title>,
"summary": <executive_summary>,
"rating": <impact_severity_rating>,
"rating_explanation": <rating_explanation>,
"findings": [
{{
"summary":<insight_1_summary>,
"explanation": <insight_1_explanation>
}},
{{
"summary":<insight_2_summary>,
"explanation": <insight_2_explanation>
}}
]
}}
## Grounding Rules
Points supported by data should list their data references as follows:
"This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."
Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (1), Entities (5, 7); Relationships (23); Claims (7, 2, 34, 64, 46, +more)]."
where 1, 5, 7, 23, 2, 34, 46, and 64 represent the id (not the index) of the relevant data record.
Do not include information where the supporting evidence for it is not provided.
Output:"""
全局查询 #
上面我们讲到整个知识库构建的逻辑和层次,所以在查询的时候同样需要对不同层级进行查询。GraphRAG 将查询分为了 2 步:
- 获取所有社区的摘要,然后将摘要信息拼接为上下文,并和 Query 一起给到大模型 ,让大模型对这些内容做逐条打分和说明,这里会多次调用模型,以此来构建社区答案(注:这一步在论文中被分为了两步即 ① 准备社区摘要 ② 获取社区答案)
- 将获取到的社区答案进行过滤和排序,然后按照排名从高到低顺序拼接到上下文中,全部给到大模型直至达到模型的 token 限制 ,以此来让大模型回答最终的全局答案
具体的代码可以看下面这段
## https://github.com/microsoft/graphrag/blob/main/graphrag/query/structured_search/global_search/search.py
class GlobalSearch(BaseSearch[GlobalContextBuilder]):
"""Search orchestration for global search mode."""
...
async def asearch(
self,
query: str,
conversation_history: ConversationHistory | None = None,
**kwargs: Any,
) -> GlobalSearchResult:
"""
- Step 1: Run parallel LLM calls on communities' short summaries to generate answer for each batch
"""
...
## Prepare batches of community report data table as context data for global search.
context_result = await self.context_builder.build_context(
query=query,
conversation_history=conversation_history,
**self.context_builder_params,
)
...
map_responses = await asyncio.gather(*[
## Generate answer for a single chunk of community reports.
self._map_response_single_batch(
context_data=data, query=query, **self.map_llm_params
)
for data in context_result.context_chunks
])
"""
- Step 2: Combine the answers from step 2 to generate the final answer
"""
reduce_response = await self._reduce_response(
map_responses=map_responses,
query=query,
**self.reduce_llm_params,
)
构建社区答案 #
除了代码方面,还有比较重要的就是这两步中的 prompt 提示词。
获取“社区答案”的提示词如下,最终让大模型给到分数和说明。注意下面这部分只是大模型的 system
人设提示词,user
提示词就是 Query 本身
"""System prompts for global search."""
MAP_SYSTEM_PROMPT = """
---Role---
You are a helpful assistant responding to questions about data in the tables provided.
---Goal---
Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables.
You should use the data provided in the data tables below as the primary context for generating the response.
If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.
Each key point in the response should have the following element:
- Description: A comprehensive description of the point.
- Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0.
The response should be JSON formatted as follows:
{{
"points": [
{{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}},
{{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}}
]
}}
The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
Points supported by data should list the relevant reports as references as follows:
"This is an example sentence supported by data references [Data: Reports (report ids)]"
**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables.
Do not include information where the supporting evidence for it is not provided.
---Data tables---
{context_data}
---Goal---
Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables.
You should use the data provided in the data tables below as the primary context for generating the response.
If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.
Each key point in the response should have the following element:
- Description: A comprehensive description of the point.
- Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0.
The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
Points supported by data should list the relevant reports as references as follows:
"This is an example sentence supported by data references [Data: Reports (report ids)]"
**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables.
Do not include information where the supporting evidence for it is not provided.
The response should be JSON formatted as follows:
{{
"points": [
{{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}},
{{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}}
]
}}
"""
构建全局答案 #
获取最终的“全局答案”的提示词如下,这里同样的只是大模型的 system
人设提示词,user
提示词同样是 Query 本身。
还有一点需要注意的是,如果在调用模型之前,“社区答案”都已经被过滤掉了,那么会直接返回 NO_DATA_ANSWER
中的内容。
"""Global Search system prompts."""
REDUCE_SYSTEM_PROMPT = """
---Role---
You are a helpful assistant responding to questions about a dataset by synthesizing perspectives from multiple analysts.
---Goal---
Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset.
Note that the analysts' reports provided below are ranked in the **descending order of importance**.
If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up.
The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format.
Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.
**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
Do not include information where the supporting evidence for it is not provided.
---Target response length and format---
{response_type}
---Analyst Reports---
{report_data}
---Goal---
Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset.
Note that the analysts' reports provided below are ranked in the **descending order of importance**.
If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up.
The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format.
The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.
**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
Do not include information where the supporting evidence for it is not provided.
---Target response length and format---
{response_type}
Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
"""
NO_DATA_ANSWER = (
"I am sorry but I am unable to answer this question given the provided data."
)
本地查询 #
看上面的全局查询可以看到,里并没有使用 embedding + vector db 的方式来查询,而是直接走的大模型。本地查询和全局最大的不同就是第一步,不走大模型而是走的 embedding + vector db:
- 对原始 Query 做 embedding,然后用向量查询知识库中的文档实体,这样可以拿到和 Query 最近似的文档,然后根据文档所在的社区、包含的分片、关系等信息以此构建社区上下文、本地上下文和 TextUnit 上下文
- 调用大模型根据社区上下文来生成最终的答案
## https://github.com/microsoft/graphrag/blob/main/graphrag/query/structured_search/local_search/search.py
class LocalSearch(BaseSearch[LocalContextBuilder]):
"""Search orchestration for local search mode."""
async def asearch(
self,
query: str,
conversation_history: ConversationHistory | None = None,
**kwargs,
) -> SearchResult:
"""Build local search context that fits a single context window and generate answer for the user query."""
...
## Step 1
context_result = self.context_builder.build_context(
query=query,
conversation_history=conversation_history,
**kwargs,
**self.context_builder_params,
)
log.info("GENERATE ANSWER: %s. QUERY: %s", start_time, query)
## Step 2
try:
...
search_messages = [
{"role": "system", "content": search_prompt},
{"role": "user", "content": query},
]
response = await self.llm.agenerate(
messages=search_messages,
streaming=True,
callbacks=self.callbacks,
**self.llm_params,
)
构建社区上下文 #
这里的话主要是分为 2 部分:
- 查询知识库:查询知识库就是传统的 embedding + vector 近似查询文档实体
entity
,这里就不展开了、 - 上下文构建
- 构建社区上下文 (Community):构建社区上下文是根据近似查询找到的近似文档,以此组装社区上下文——关注点是社区的知识和总结
- 构建本地上下文 (Local):主要是构建包含一个实体及其所在图中关系的上下文——关注点是查询的文档实体本身及其关系
- 构建文本单元上下文 (Text Unit):对匹配的文本单元排序,将其作为上下文
## https://github.com/microsoft/graphrag/blob/main/graphrag/query/structured_search/local_search/mixed_context.py
class LocalSearchMixedContext(LocalContextBuilder):
"""Build data context for local search prompt combining community reports and entity/relationship/covariate tables."""
...
def build_context(...) -> ContextBuilderResult:
"""
Build data context for local search prompt.
Build a context by combining community reports and entity/relationship/covariate tables, and text units using a predefined ratio set by summary_prop.
"""
...
## Step 1 : vector similarity search
selected_entities = map_query_to_entities(
query=query,
text_embedding_vectorstore=self.entity_text_embeddings,
text_embedder=self.text_embedder,
all_entities_dict=self.entities,
embedding_vectorstore_key=self.embedding_vectorstore_key,
include_entity_names=include_entity_names,
exclude_entity_names=exclude_entity_names,
k=top_k_mapped_entities,
oversample_scaler=2,
)
...
## Step 2 : build community context by selected entities
community_context, community_context_data = self._build_community_context(
selected_entities=selected_entities,
max_tokens=community_tokens,
use_community_summary=use_community_summary,
column_delimiter=column_delimiter,
include_community_rank=include_community_rank,
min_community_rank=min_community_rank,
return_candidate_context=return_candidate_context,
context_name=community_context_name,
)
...
## Step 3 : build local content by selected entities (i.e. entity-relationship-covariate) context
local_prop = 1 - community_prop - text_unit_prop
local_tokens = max(int(max_tokens * local_prop), 0)
local_context, local_context_data = self._build_local_context(
selected_entities=selected_entities,
max_tokens=local_tokens,
include_entity_rank=include_entity_rank,
rank_description=rank_description,
include_relationship_weight=include_relationship_weight,
top_k_relationships=top_k_relationships,
relationship_ranking_attribute=relationship_ranking_attribute,
return_candidate_context=return_candidate_context,
column_delimiter=column_delimiter,
)
...
## Step 4 :
text_unit_context, text_unit_context_data = self._build_text_unit_context(
selected_entities=selected_entities,
max_tokens=text_unit_tokens,
return_candidate_context=return_candidate_context,
)
...
## https://github.com/microsoft/graphrag/blob/main/graphrag/query/context_builder/entity_extraction.py
## Step 1
def map_query_to_entities(
query: str,
text_embedding_vectorstore: BaseVectorStore,
text_embedder: BaseTextEmbedding,
all_entities_dict: dict[str, Entity],
embedding_vectorstore_key: str = EntityVectorStoreKey.ID,
include_entity_names: list[str] | None = None,
exclude_entity_names: list[str] | None = None,
k: int = 10,
oversample_scaler: int = 2,
) -> list[Entity]:
"""Extract entities that match a given query using semantic similarity of text embeddings of query and entity descriptions."""
...
matched_entities = []
if query != "":
## get entities with highest semantic similarity to query
## oversample to account for excluded entities
search_results = text_embedding_vectorstore.similarity_search_by_text(
text=query,
text_embedder=lambda t: text_embedder.embed(t),
k=k * oversample_scaler,
)
...
...
获取本地答案 #
最后就是把上面构建好的上下文给到大模型,下面这里主要是 system
人设提示词,user
提示词就是 Query 本身
QUESTION_SYSTEM_PROMPT = """
---Role---
You are a helpful assistant generating a bulleted list of {question_count} questions about data in the tables provided.
---Data tables---
{context_data}
---Goal---
Given a series of example questions provided by the user, generate a bulleted list of {question_count} candidates for the next question. Use - marks as bullet points.
These candidate questions should represent the most important or urgent information content or themes in the data tables.
The candidate questions should be answerable using the data tables provided, but should not mention any specific data fields or data tables in the question text.
If the user's questions reference several named entities, then each candidate question should reference all named entities.
---Example questions---
"""
相关 #
A modular graph-based Retrieval-Augmented Generation (RAG) system
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
LightRAG #
概要 #
LightRAG 的提出,主要是想聚焦于 2 个要点:第一个点和 GraphRAG 类似主要是想构建文档实体之间的复杂关系和上下文;第二个点主要在对实体关系和上下文的持久化,以此来提高查询的效率。
所以 LightRAG 是想在保持了完整的文档上下文关系的情况下,尽可能提高检索效率和快速的数据更新。
LightRAG 可以理解为一个简化版的 GraphRAG ,去掉了 GraphRAG 中 Community 的部分直接基于 Entity 及其关系来构建图 Graph,大大降低了复杂性和 token 消耗。显而易见地,这会让 LightRAG 不论是在构建还是查询都会比 GraphRAG 更快。
索引构建 #
在知识库文档所有构建方面,LightRAG 的重点主要在于实体和关系的提取还有分析上面,其他方面做得相对简单。
切片方面主要还是以 token 切片为主,具体可以看chunking_by_token_size
方法。
下面主要聚焦于构建解析 Entity 实体和 构建 Graph。
解析这部分主要是将分片后的 chunk 传给大模型,让大模型来总结 entity
、relationship
和 keywords
。 entity
会被作为 Graph 中的节点,relationship
会作为 Graph 中的边。
具体的提示词 prompt 如下:
## https://github.com/HKUDS/LightRAG/blob/main/lightrag/prompt.py
PROMPTS["entity_extraction"] = """-Goal-
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, use same language as input text. If English, capitalized the name.
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>
2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)
3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"{tuple_delimiter}<high_level_keywords>)
4. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.
5. When finished, output {completion_delimiter}
######################
-Examples-
######################
Example 1:
Entity_types: [person, technology, mission, organization, location]
Text:
while Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor's authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan's shared commitment to discovery was an unspoken rebellion against Cruz's narrowing vision of control and order.
Then Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. “If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us.”
The underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor's, a wordless clash of wills softening into an uneasy truce.
It was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths
################
Output:
("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is a character who experiences frustration and is observant of the dynamics among other characters."){record_delimiter}
("entity"{tuple_delimiter}"Taylor"{tuple_delimiter}"person"{tuple_delimiter}"Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective."){record_delimiter}
("entity"{tuple_delimiter}"Jordan"{tuple_delimiter}"person"{tuple_delimiter}"Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device."){record_delimiter}
("entity"{tuple_delimiter}"Cruz"{tuple_delimiter}"person"{tuple_delimiter}"Cruz is associated with a vision of control and order, influencing the dynamics among other characters."){record_delimiter}
("entity"{tuple_delimiter}"The Device"{tuple_delimiter}"technology"{tuple_delimiter}"The Device is central to the story, with potential game-changing implications, and is revered by Taylor."){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Taylor"{tuple_delimiter}"Alex is affected by Taylor's authoritarian certainty and observes changes in Taylor's attitude towards the device."{tuple_delimiter}"power dynamics, perspective shift"{tuple_delimiter}7){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Jordan"{tuple_delimiter}"Alex and Jordan share a commitment to discovery, which contrasts with Cruz's vision."{tuple_delimiter}"shared goals, rebellion"{tuple_delimiter}6){record_delimiter}
("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"Jordan"{tuple_delimiter}"Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."{tuple_delimiter}"conflict resolution, mutual respect"{tuple_delimiter}8){record_delimiter}
("relationship"{tuple_delimiter}"Jordan"{tuple_delimiter}"Cruz"{tuple_delimiter}"Jordan's commitment to discovery is in rebellion against Cruz's vision of control and order."{tuple_delimiter}"ideological conflict, rebellion"{tuple_delimiter}5){record_delimiter}
("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"The Device"{tuple_delimiter}"Taylor shows reverence towards the device, indicating its importance and potential impact."{tuple_delimiter}"reverence, technological significance"{tuple_delimiter}9){record_delimiter}
("content_keywords"{tuple_delimiter}"power dynamics, ideological conflict, discovery, rebellion"){completion_delimiter}
#############################
Example 2:
Entity_types: [person, technology, mission, organization, location]
Text:
They were no longer mere operatives; they had become guardians of a threshold, keepers of a message from a realm beyond stars and stripes. This elevation in their mission could not be shackled by regulations and established protocols—it demanded a new perspective, a new resolve.
Tension threaded through the dialogue of beeps and static as communications with Washington buzzed in the background. The team stood, a portentous air enveloping them. It was clear that the decisions they made in the ensuing hours could redefine humanity's place in the cosmos or condemn them to ignorance and potential peril.
Their connection to the stars solidified, the group moved to address the crystallizing warning, shifting from passive recipients to active participants. Mercer's latter instincts gained precedence— the team's mandate had evolved, no longer solely to observe and report but to interact and prepare. A metamorphosis had begun, and Operation: Dulce hummed with the newfound frequency of their daring, a tone set not by the earthly
#############
Output:
("entity"{tuple_delimiter}"Washington"{tuple_delimiter}"location"{tuple_delimiter}"Washington is a location where communications are being received, indicating its importance in the decision-making process."){record_delimiter}
("entity"{tuple_delimiter}"Operation: Dulce"{tuple_delimiter}"mission"{tuple_delimiter}"Operation: Dulce is described as a mission that has evolved to interact and prepare, indicating a significant shift in objectives and activities."){record_delimiter}
("entity"{tuple_delimiter}"The team"{tuple_delimiter}"organization"{tuple_delimiter}"The team is portrayed as a group of individuals who have transitioned from passive observers to active participants in a mission, showing a dynamic change in their role."){record_delimiter}
("relationship"{tuple_delimiter}"The team"{tuple_delimiter}"Washington"{tuple_delimiter}"The team receives communications from Washington, which influences their decision-making process."{tuple_delimiter}"decision-making, external influence"{tuple_delimiter}7){record_delimiter}
("relationship"{tuple_delimiter}"The team"{tuple_delimiter}"Operation: Dulce"{tuple_delimiter}"The team is directly involved in Operation: Dulce, executing its evolved objectives and activities."{tuple_delimiter}"mission evolution, active participation"{tuple_delimiter}9){completion_delimiter}
("content_keywords"{tuple_delimiter}"mission evolution, decision-making, active participation, cosmic significance"){completion_delimiter}
#############################
Example 3:
Entity_types: [person, role, technology, organization, event, location, concept]
Text:
their voice slicing through the buzz of activity. "Control may be an illusion when facing an intelligence that literally writes its own rules," they stated stoically, casting a watchful eye over the flurry of data.
"It's like it's learning to communicate," offered Sam Rivera from a nearby interface, their youthful energy boding a mix of awe and anxiety. "This gives talking to strangers' a whole new meaning."
Alex surveyed his team—each face a study in concentration, determination, and not a small measure of trepidation. "This might well be our first contact," he acknowledged, "And we need to be ready for whatever answers back."
Together, they stood on the edge of the unknown, forging humanity's response to a message from the heavens. The ensuing silence was palpable—a collective introspection about their role in this grand cosmic play, one that could rewrite human history.
The encrypted dialogue continued to unfold, its intricate patterns showing an almost uncanny anticipation
#############
Output:
("entity"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"person"{tuple_delimiter}"Sam Rivera is a member of a team working on communicating with an unknown intelligence, showing a mix of awe and anxiety."){record_delimiter}
("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is the leader of a team attempting first contact with an unknown intelligence, acknowledging the significance of their task."){record_delimiter}
("entity"{tuple_delimiter}"Control"{tuple_delimiter}"concept"{tuple_delimiter}"Control refers to the ability to manage or govern, which is challenged by an intelligence that writes its own rules."){record_delimiter}
("entity"{tuple_delimiter}"Intelligence"{tuple_delimiter}"concept"{tuple_delimiter}"Intelligence here refers to an unknown entity capable of writing its own rules and learning to communicate."){record_delimiter}
("entity"{tuple_delimiter}"First Contact"{tuple_delimiter}"event"{tuple_delimiter}"First Contact is the potential initial communication between humanity and an unknown intelligence."){record_delimiter}
("entity"{tuple_delimiter}"Humanity's Response"{tuple_delimiter}"event"{tuple_delimiter}"Humanity's Response is the collective action taken by Alex's team in response to a message from an unknown intelligence."){record_delimiter}
("relationship"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"Intelligence"{tuple_delimiter}"Sam Rivera is directly involved in the process of learning to communicate with the unknown intelligence."{tuple_delimiter}"communication, learning process"{tuple_delimiter}9){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"First Contact"{tuple_delimiter}"Alex leads the team that might be making the First Contact with the unknown intelligence."{tuple_delimiter}"leadership, exploration"{tuple_delimiter}10){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Humanity's Response"{tuple_delimiter}"Alex and his team are the key figures in Humanity's Response to the unknown intelligence."{tuple_delimiter}"collective action, cosmic significance"{tuple_delimiter}8){record_delimiter}
("relationship"{tuple_delimiter}"Control"{tuple_delimiter}"Intelligence"{tuple_delimiter}"The concept of Control is challenged by the Intelligence that writes its own rules."{tuple_delimiter}"power dynamics, autonomy"{tuple_delimiter}7){record_delimiter}
("content_keywords"{tuple_delimiter}"first contact, control, communication, cosmic significance"){completion_delimiter}
#############################
-Real Data-
######################
Entity_types: {entity_types}
Text: {input_text}
######################
Output:
"""
可以看到这个和 GraphRAG 的提示词是比较类似的,然后 LightRAG 会解析出其中的 entity
和 relationship
并入库。
entity
主要包含了名称 name
、类型 type
、描述 description
和 source_id
,
relationship
主要包含了 src_id
、tgt_id
、权重 weight
、描述description
、关键字keywords
和 source_id
这里的 source_id
指的都是 chunk key 即 分块的 key;这里的 src_id
和 tgt_id
指的是 entity_name
根据这些就可以大体把知识库的 Graph 构建出来了,LightRAG 构建索引的部分就是这些内容
查询 #
LightRAG 提供了 4 种模式的查询 ① local ② global ③ hybrid ④ naive
第 4 中模式就是经典的 RAG 检索:根据 Query 查询 topK chunks,然后将 Query 和 chunks 传给大模型获取最终结果;
本节我们重点讲前三种模式的查询。前三种模式的查询大体类似,可以概括为下面 3 步:
-
调用大模型解析 Query 中的关键字,prompt 如下。至于关键字,LightRAG 分了两类:
low_level_keywords
:主要指的是特别具体的能对应到 entity 的关键字high_level_keywords
:主要指的是总体的大体概念的一类关键字
## https://github.com/HKUDS/LightRAG/blob/main/lightrag/prompt.py PROMPTS["keywords_extraction"] = """---Role--- You are a helpful assistant tasked with identifying both high-level and low-level keywords in the user's query. ---Goal--- Given the query, list both high-level and low-level keywords. High-level keywords focus on overarching concepts or themes, while low-level keywords focus on specific entities, details, or concrete terms. ---Instructions--- - Output the keywords in JSON format. - The JSON should have two keys: - "high_level_keywords" for overarching concepts or themes. - "low_level_keywords" for specific entities or details. ###################### -Examples- ###################### Example 1: Query: "How does international trade influence global economic stability?" ################ Output: {{ "high_level_keywords": ["International trade", "Global economic stability", "Economic impact"], "low_level_keywords": ["Trade agreements", "Tariffs", "Currency exchange", "Imports", "Exports"] }} ############################# Example 2: Query: "What are the environmental consequences of deforestation on biodiversity?" ################ Output: {{ "high_level_keywords": ["Environmental consequences", "Deforestation", "Biodiversity loss"], "low_level_keywords": ["Species extinction", "Habitat destruction", "Carbon emissions", "Rainforest", "Ecosystem"] }} ############################# Example 3: Query: "What is the role of education in reducing poverty?" ################ Output: {{ "high_level_keywords": ["Education", "Poverty reduction", "Socioeconomic development"], "low_level_keywords": ["School access", "Literacy rates", "Job training", "Income inequality"] }} ############################# -Real Data- ###################### Query: {query} ###################### Output: """
-
基于关键词构建上下文,这三种模式下构建出来的上下文都是一样的,具体可以看下面的代码部分
- local 模式下,使用解析出的
low_level_keywords
关键字向量查询 topK entity 数据,然后根据查询出的 entity 获取相关的 relation 和原始 chunk,以此构建整体的上下文 - global 模式下,使用解析出的
high_level_keywords
关键字向量查询 topK relation 数据,然后感觉查询出的 relation 获取相关的 entity 和原始 chunk,以此构建整体的上下文 - hybrid 模式下,分别使用解析出的
low_level_keywords
和high_level_keywords
获取 local 模式上下文和 global 模式上下文,然后对这两种模式下的 entity、relation 和 chunk 进行组合,以此来构建上下文
- local 模式下,使用解析出的
## https://github.com/HKUDS/LightRAG/blob/main/lightrag/operate.py
context = f"""
-----Entities-----
```csv
{entities_context}
```
-----Relationships-----
```csv
{relations_context}
```
-----Sources-----
```csv
{text_units_context}
```
"""
-
根据第二步构建的上下文数据,请求大模型获得最终结果。上下文数据是拼接在
system
人设中给到模型的,user
提示词就是 Query 本身。人设的 prompt 模板如下:## https://github.com/HKUDS/LightRAG/blob/main/lightrag/prompt.py PROMPTS["rag_response"] = """---Role--- You are a helpful assistant responding to questions about data in the tables provided. ---Goal--- Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge. If you don't know the answer, just say so. Do not make anything up. Do not include information where the supporting evidence for it is not provided. ---Target response length and format--- {response_type} ---Data tables--- {context_data} Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown. """
持久化 #
之前我们说过,为了提高查询效率,LightRAG 对整个图进行了持久化,最后我们来具体看看它是如何持久化的。
LightRAG 中主要有三种类型数据需要存储
- kv 数据:用于存储原始文档和 chunk 的 kv 数据,还会用于大模型结果的缓存。目前为止 LightRAG 支持了 内存 kv 缓存和
Oracle
。 - 向量 vector 数据:用于存储 chunk、entity 和 relation 的向量数据。目前为止 LightRAG 支持了
nano-vectordb
和Oracle
。 - graph 图数据:用于存储整个 graph 结构。目前为止 LightRAG 支持了
NetworkX
、Neo4J
和Oracle
。
这里我们主要看一下 Oracle
的表结构
## https://github.com/HKUDS/LightRAG/blob/main/lightrag/kg/oracle_impl.py
TABLES = {
## full doc
"LIGHTRAG_DOC_FULL": {
"ddl": """CREATE TABLE LIGHTRAG_DOC_FULL (
id varchar(256)PRIMARY KEY,
workspace varchar(1024),
doc_name varchar(1024),
content CLOB,
meta JSON,
createtime TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updatetime TIMESTAMP DEFAULT NULL
)"""
},
## chunk text
"LIGHTRAG_DOC_CHUNKS": {
"ddl": """CREATE TABLE LIGHTRAG_DOC_CHUNKS (
id varchar(256) PRIMARY KEY,
workspace varchar(1024),
full_doc_id varchar(256),
chunk_order_index NUMBER,
tokens NUMBER,
content CLOB,
content_vector VECTOR,
createtime TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updatetime TIMESTAMP DEFAULT NULL
)"""
},
## entity
"LIGHTRAG_GRAPH_NODES": {
"ddl": """CREATE TABLE LIGHTRAG_GRAPH_NODES (
id NUMBER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
workspace varchar(1024),
name varchar(2048),
entity_type varchar(1024),
description CLOB,
source_chunk_id varchar(256),
content CLOB,
content_vector VECTOR,
createtime TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updatetime TIMESTAMP DEFAULT NULL
)"""
},
## relation
"LIGHTRAG_GRAPH_EDGES": {
"ddl": """CREATE TABLE LIGHTRAG_GRAPH_EDGES (
id NUMBER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
workspace varchar(1024),
source_name varchar(2048),
target_name varchar(2048),
weight NUMBER,
keywords CLOB,
description CLOB,
source_chunk_id varchar(256),
content CLOB,
content_vector VECTOR,
createtime TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updatetime TIMESTAMP DEFAULT NULL
)"""
},
## cache
"LIGHTRAG_LLM_CACHE": {
"ddl": """CREATE TABLE LIGHTRAG_LLM_CACHE (
id varchar(256) PRIMARY KEY,
send clob,
return clob,
model varchar(1024),
createtime TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updatetime TIMESTAMP DEFAULT NULL
)"""
},
## graph
"LIGHTRAG_GRAPH": {
"ddl": """CREATE OR REPLACE PROPERTY GRAPH lightrag_graph
VERTEX TABLES (
lightrag_graph_nodes KEY (id)
LABEL entity
PROPERTIES (id,workspace,name) -- ,entity_type,description,source_chunk_id)
)
EDGE TABLES (
lightrag_graph_edges KEY (id)
SOURCE KEY (source_name) REFERENCES lightrag_graph_nodes(name)
DESTINATION KEY (target_name) REFERENCES lightrag_graph_nodes(name)
LABEL has_relation
PROPERTIES (id,workspace,source_name,target_name) -- ,weight, keywords,description,source_chunk_id)
) OPTIONS(ALLOW MIXED PROPERTY TYPES)"""
},
}
相关 #
“LightRAG: Simple and Fast Retrieval-Augmented Generation”
LightRAG: Simple and Fast Retrieval-Augmented Generation
RAPTOR RAG #
概要 #
RAPTOR 是斯坦福在一月提出的一种 RAG 方法。主要的思路是将 chunk 做聚类,让大模型总结这部分聚类的摘要,然后重复,自下而上构建成一个树状结构。
索引构建 #
上面说到 RAPTOR 的思路就是把 chunks 作为叶子节点,然后聚类 → 总结摘要 → 聚类 → 总结摘要 → … 直到根节点,总体不能理解,下面我们直接上代码
## https://github.com/parthsarthi03/raptor/blob/master/raptor/cluster_tree_builder.py
class ClusterTreeBuilder(TreeBuilder):
def construct_tree(
self,
current_level_nodes: Dict[int, Node],
all_tree_nodes: Dict[int, Node],
layer_to_nodes: Dict[int, List[Node]],
use_multithreading: bool = False,
) -> Dict[int, Node]:
logging.info("Using Cluster TreeBuilder")
next_node_index = len(all_tree_nodes)
## summarize nodes in caluster
def process_cluster(
cluster, new_level_nodes, next_node_index, summarization_length, lock
):
node_texts = get_text(cluster)
## get nodes in cluster summary by llm
summarized_text = self.summarize(
context=node_texts,
max_tokens=summarization_length,
)
logging.info(
f"Node Texts Length: {len(self.tokenizer.encode(node_texts))}, Summarized Text Length: {len(self.tokenizer.encode(summarized_text))}"
)
__, new_parent_node = self.create_node(
next_node_index, summarized_text, {node.index for node in cluster}
)
with lock:
new_level_nodes[next_node_index] = new_parent_node
## num of layers is 5
for layer in range(self.num_layers):
new_level_nodes = {}
logging.info(f"Constructing Layer {layer}")
node_list_current_layer = get_node_list(current_level_nodes)
if len(node_list_current_layer) <= self.reduction_dimension + 1:
self.num_layers = layer
logging.info(
f"Stopping Layer construction: Cannot Create More Layers. Total Layers in tree: {layer}"
)
break
## get cluaster
clusters = self.clustering_algorithm.perform_clustering(
node_list_current_layer,
self.cluster_embedding_model,
reduction_dimension=self.reduction_dimension,
**self.clustering_params,
)
lock = Lock()
summarization_length = self.summarization_length
logging.info(f"Summarization Length: {summarization_length}")
if use_multithreading:
with ThreadPoolExecutor() as executor:
for cluster in clusters:
executor.submit(
process_cluster,
cluster,
new_level_nodes,
next_node_index,
summarization_length,
lock,
)
next_node_index += 1
executor.shutdown(wait=True)
else:
for cluster in clusters:
process_cluster(
cluster,
new_level_nodes,
next_node_index,
summarization_length,
lock,
)
next_node_index += 1
layer_to_nodes[layer + 1] = list(new_level_nodes.values())
current_level_nodes = new_level_nodes
all_tree_nodes.update(new_level_nodes)
tree = Tree(
all_tree_nodes,
layer_to_nodes[layer + 1],
layer_to_nodes[0],
layer + 1,
layer_to_nodes,
)
return current_level_nodes
这里总结的 prompt 比较简单,仅仅是把 nodes 给大模型让它做总结,这里就不额外列出来了。
查询 #
这部分也比较简单,RAPTOR 在这里提供了两个检索方式 ① 检索树种的所有节点 ② 检索特定某一层的所有节点。默认都是用第 ① 种。
检索的具体过程是先获取 query 的 embedding 和需要检索的所有 nodes 节点的 embedding,然后计算 query embedding 和 nodes embedding 之间的距离并排序,最后选出前 top_k 作为上下文即可。
## https://github.com/parthsarthi03/raptor/blob/master/raptor/tree_retriever.py
class TreeRetriever(BaseRetriever):
def retrieve(
self,
query: str,
start_layer: int = None,
num_layers: int = None,
top_k: int = 10,
max_tokens: int = 3500,
collapse_tree: bool = True,
return_layer_information: bool = False,
) -> str:
...
if collapse_tree:
## default
## Retrieves the most relevant nodes from the all tree nodes based on query
logging.info(f"Using collapsed_tree")
selected_nodes, context = self.retrieve_information_collapse_tree(
query, top_k, max_tokens
)
else:
## Retrieves the most relevant nodes from the specific layers tree nodes based on query
layer_nodes = self.tree.layer_to_nodes[start_layer]
selected_nodes, context = self.retrieve_information(
layer_nodes, query, num_layers
)
相关 #
The official implementation of RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
扩展 #
上下文检索 #
这是 Anthropic 提出的一种在传统 RAG 上做的改进。
文章中提到目前大模型的长上下文能力已经得到了很大的改善,所以有一种更简单粗暴的 RAG 方法就是在每个 chunk 中添加大模型对全文+当前 chunk 的摘要,以此让每个 chunk 分片具有全文上下文的能力。这样一来可以大大降低在检索 chunk 时,由于 chunk 没有上下文信息导致的检索效果差的问题。但是这种方式没有办法原始文档之间的关系。
总结 chunk 的 prompt 如下:
<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.
Introducing Contextual Retrieval
RAG 评估 #
Amazon 在六月提出的一个可以评估 RAG 整体性能和效率的开源库,感兴趣的话可以看看
RAGChecker: A Fine-grained Framework For Diagnosing RAG
总结 #
最后稍微总结一下, GraphRAG 目前看起来是一套非常全面完整的方案,但是它的问题在于构建和查询过程中的 token 消耗量是远大于其他方法的,且查询效率低。
于是现在衍生出了很多在 GraphRAG 上做优化的 RAG,LightRAG 是做的比较好的其中之一。看 LightRAG 的方案也可以明显地感受到在 token 消耗和检索效率确实会好不少。
RAPTOR 用得就比较少了,总体的方案也不是很复杂;至于 Anthropic 提出的上下文检索,在 chunk 层面做了不少优化,如果只有几个少量的文档,这种方式不失为一种好方法,足够简单直接。
至于我们最终要选那种方式,需要评估文档的规模。如果比较小,那上下文检索甚至都够了,尽量给大模型传递更多的上下文就行;如果规模进一步上升就可以考虑 LightRAG;如果是企业级的一些文档库,且会有专人去维护开发,那选择 GraphRAG 当然是最好的,但是维护和资源成本也显而易见。