Introduction #
GraphRAG is a well-known RAG algorithm, and LightRAG can be seen as a younger or lighter version of GraphRAG, focusing on being more lightweight, improving efficiency, and persistence, such as:
- LightRAG removes the concept of
Community
from GraphRAG, focusing only onEntity
,Relation
, and theGraph
itself, greatly simplifying the entire knowledge base construction process - LightRAG provides some persistence mechanisms, dividing storage into kv storage, vector storage, and graph storage, and supports different types of storage forms like
oracle
,chroma
,neo4j
, etc. - LightRAG also simplifies query efficiency by performing vector retrieval during queries; then retrieving related
Entity
orRelation
based on graph search to construct the entire context; finally, only one call to the large model is needed to generate a response
This article mainly aims to introduce how TiDB Vector integrates with LightRAG.
LightRAG Process #
Let’s first review the overall process of LightRAG, mainly from the perspectives of index construction and querying.
For the complete paper, you can look here, and the complete source code is at:
“LightRAG: Simple and Fast Retrieval-Augmented Generation”
If you don’t quite understand GraphRAG, let’s first briefly explain Chunk
, Entity
, and Relation
Chunk
is actually just a slice of the complete document. LightRAG mainly supports token slicing, though this is far from sufficient for all documents. Many markdown documents can even have some special structured slicingEntity
can be understood as keywords inChunk
or some key information, high-frequency vocabulary, etc.Entity
is also divided into different types. Of course, theEntity
instances here are summarized with the help of large models, and the specific prompts can be found in the lightrag/prompt.py code- Finally, there’s
Relation
. As mentioned earlier aboutEntity
, building a graph based onEntity
requires relationships between various nodes, which is whatRelation
is. Similarly,Relation
is also summarized and extracted by large models.
Now that we understand these concepts, let’s look at the specific process.
Index construction process can be roughly divided into the following steps:
- First, we need to slice the document into
Chunks
. This process is relatively simple, mainly using token-based slicing - Then, we need to extract
Entity
from theseChunks
. This step requires the help of large models, which will analyze the content of eachChunk
and extract key information - After extracting
Entity
, we need to analyze the relationships between theseEntity
, which isRelation
. This also requires large model assistance - Finally, we need to store all this information. This includes:
- Store the original
Chunks
- Store the vector representations of these
Chunks
- Store the extracted
Entity
information - Store the
Relation
information betweenEntity
- Store the original
Query process is relatively simple:
- First, we need to convert the query into a vector representation
- Then use this vector to find similar
Chunks
- Extract
Entity
from theseChunks
- Use these
Entity
to find relatedEntity
orRelation
in the graph - Finally, use all this information to generate a response
Table Structure #
I mainly divided the tables into four categories, combining the kv storage, vector storage, and graph storage mentioned earlier. The following tables are introduced in order:
LIGHTRAG_DOC_FULL table is mainly used to store the original document. In LightRAG, kv storage is used.
CREATE TABLE LIGHTRAG_DOC_FULL (
`id` BIGINT PRIMARY KEY AUTO_RANDOM,
`doc_id` VARCHAR(256) NOT NULL,
`workspace` varchar(1024),
`content` LONGTEXT,
`meta` JSON,
`createtime` TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
`updatetime` TIMESTAMP DEFAULT NULL,
UNIQUE KEY (`doc_id`)
);
LIGHTRAG_DOC_CHUNKS table is mainly used to store the sliced original document, i.e., Chunk
.
In LightRAG, kv storage is used to store slices, and vector storage is used to store vector data of slices.
CREATE TABLE LIGHTRAG_DOC_CHUNKS (
`id` BIGINT PRIMARY KEY AUTO_RANDOM,
`chunk_id` VARCHAR(256) NOT NULL,
`full_doc_id` VARCHAR(256) NOT NULL,
`workspace` varchar(1024),
`chunk_order_index` INT,
`tokens` INT,
`content` LONGTEXT,
`content_vector` VECTOR,
`createtime` DATETIME DEFAULT CURRENT_TIMESTAMP,
`updatetime` DATETIME DEFAULT NULL,
UNIQUE KEY (`chunk_id`)
);
LIGHTRAG_GRAPH_NODES table is mainly used to store the parsed Entity.
In LightRAG, vector storage is used to store the content and vector of Entity
, and graph storage is used to store the information of Entity
in the graph, such as entity_type
, description
, as attributes of Entity
in the graph.
It’s also worth noting that LightRAG has already performed deduplication and merging on Entity
before storage, ensuring that there are no identical nodes when building the graph.
CREATE TABLE LIGHTRAG_GRAPH_NODES (
`id` BIGINT PRIMARY KEY AUTO_RANDOM,
`entity_id` VARCHAR(256),
`workspace` varchar(1024),
`name` VARCHAR(2048),
`entity_type` VARCHAR(1024),
`description` LONGTEXT,
`source_chunk_id` VARCHAR(256),
`content` LONGTEXT,
`content_vector` VECTOR,
`createtime` DATETIME DEFAULT CURRENT_TIMESTAMP,
`updatetime` DATETIME DEFAULT NULL,
KEY (`entity_id`)
);
LIGHTRAG_GRAPH_EDGES table is mainly used to store Relations
.
Similarly, in LightRAG, vector storage is used to store the content and vector of Relations
, and graph storage is used to store the information of Relations
in the graph, such as weight
, keywords
, description
, etc.
It’s also worth noting that GraphRAG and LightRAG ultimately build an undirected graph with weights. So for this table, the combination of source_name
and target_name
will only appear once. This part of deduplication and merging is also done by LightRAG in the code.
As for why an undirected graph with weights is chosen? I personally guess that RAG retrieval mainly focuses on the connection relationships (semantic relevance) and weights (strength of association) between nodes in the graph, rather than the order of nodes in the graph. This way, only the relationship itself and the weight need to be considered, without worrying about who is the source and who is the target; of course, being undirected also reduces the complexity of the structure.
CREATE TABLE LIGHTRAG_GRAPH_EDGES (
`id` BIGINT PRIMARY KEY AUTO_RANDOM,
`relation_id` VARCHAR(256),
`workspace` varchar(1024),
`source_name` VARCHAR(2048),
`target_name` VARCHAR(2048),
`weight` DECIMAL,
`keywords` TEXT,
`description` LONGTEXT,
`source_chunk_id` varchar(256),
`content` LONGTEXT,
`content_vector` VECTOR,
`createtime` DATETIME DEFAULT CURRENT_TIMESTAMP,
`updatetime` DATETIME DEFAULT NULL,
KEY (`relation_id`)
);
SQL and Usage #
Earlier, it was mentioned that there are two main parts: index construction and querying. The index construction part is relatively simple, mainly involving CRUD operations, so we’ll focus on the querying part.
The query process first performs vector retrieval and then uses the graph to query related Entity
and Relation
.
Let’s take a look at the vector retrieval part. The following is just an SQL example, which can be optimized and rewritten according to specific scenarios:
## Search for Entity/Node Vector
SELECT n.name as entity_name
FROM (
SELECT entity_id as id,
name,
VEC_COSINE_DISTANCE(content_vector, :embedding_string) as distance
FROM LIGHTRAG_GRAPH_NODES
WHERE workspace = :workspace
) n
WHERE n.distance > :better_than_threshold
ORDER BY n.distance DESC
LIMIT :top_k;
## Search for Relation/Edge Vector
SELECT e.source_name as src_id,
e.target_name as tgt_id
FROM (
SELECT source_name,
target_name,
VEC_COSINE_DISTANCE(content_vector, :embedding_string) as distance
FROM LIGHTRAG_GRAPH_EDGES
WHERE workspace = :workspace
) e
WHERE e.distance > :better_than_threshold
ORDER BY e.distance DESC
LIMIT :top_k
Next, we need to query the detailed information of the vector retrieval results and get the degree of Entity
or Relation
:
- The degree of
Entity
refers to the number of edges connected to this node - The degree of
Relation
refers to the sum of the degrees of the twoEntity
that make up theRelation
Here, only one SQL statement is needed, which is suitable for undirected graphs:
SELECT COUNT(id) AS cnt
FROM LIGHTRAG_GRAPH_EDGES
WHERE workspace = :workspace
AND :name IN (source_name, target_name)
At this point, we have obtained the content and degrees of Entity
or Relation
through vector retrieval. The next step is to query related Relation
or Entity
based on these results:
- Query the related edges (i.e.,
Relation
) based on theEntity
obtained from vector retrieval - Find the corresponding
src_id
andtgt_id
based on theRelation
obtained from vector retrieval, and query these nodes (i.e.,Entity
)
Finally, we need to query the related Relation
or Entity
of these Chunks
, which involves some simple SQL queries that I won’t elaborate on here. Interested readers can download the source code to run it.
Conclusion #
Obviously, from LightRAG’s perspective, TiDB has significant advantages compared to many storage engines that only handle relational or vector data.
Regarding the Graph part, I searched for TiGraph and related operators, but there are relatively few specific usage examples, so I haven’t delved deeper into it yet. ——TiGraph: 8,700x Computing Performance Achieved by Combining Graphs + the RDBMS Syntax
From my current understanding of TiDB and TiDB Vector, in the case of undirected graphs, a single TiDB is basically sufficient to complete the persistence of most scenarios of GraphRAG and its derivatives. Apart from this, only commercialized and mature Oracle or community-driven PostgreSQL can compete with it.
Finally, I recommend TiDB’s open-source code based on GraphRAG pingcap/autoflow. I’ve been too busy lately, and I’ll take a closer look when I have time.
pingcap/autoflow is a Graph RAG based and conversational knowledge base tool built with TiDB Serverless Vector Storage. Demo: https://tidb.ai