Skip to main content
  1. Articels/

Building a LightRAG Knowledge Base with TiDB Vector

·1445 words·7 mins
RAG LLM AI TiDB Engineering Practice
Weaxs
Author
Weaxs
Table of Contents

Introduction
#

GraphRAG is a well-known RAG algorithm, and LightRAG can be seen as a younger or lighter version of GraphRAG, focusing on being more lightweight, improving efficiency, and persistence, such as:

  • LightRAG removes the concept of Community from GraphRAG, focusing only on Entity, Relation, and the Graph itself, greatly simplifying the entire knowledge base construction process
  • LightRAG provides some persistence mechanisms, dividing storage into kv storage, vector storage, and graph storage, and supports different types of storage forms like oracle, chroma, neo4j, etc.
  • LightRAG also simplifies query efficiency by performing vector retrieval during queries; then retrieving related Entity or Relation based on graph search to construct the entire context; finally, only one call to the large model is needed to generate a response

This article mainly aims to introduce how TiDB Vector integrates with LightRAG.

LightRAG Process
#

Let’s first review the overall process of LightRAG, mainly from the perspectives of index construction and querying. For the complete paper, you can look here, and the complete source code is at:

LightRAG.png

If you don’t quite understand GraphRAG, let’s first briefly explain Chunk, Entity, and Relation

  • Chunk is actually just a slice of the complete document. LightRAG mainly supports token slicing, though this is far from sufficient for all documents. Many markdown documents can even have some special structured slicing
  • Entity can be understood as keywords in Chunk or some key information, high-frequency vocabulary, etc. Entity is also divided into different types. Of course, the Entity instances here are summarized with the help of large models, and the specific prompts can be found in the lightrag/prompt.py code
  • Finally, there’s Relation. As mentioned earlier about Entity, building a graph based on Entity requires relationships between various nodes, which is what Relation is. Similarly, Relation is also summarized and extracted by large models.

Now that we understand these concepts, let’s look at the specific process.

Index construction process can be roughly divided into the following steps:

  1. First, we need to slice the document into Chunks. This process is relatively simple, mainly using token-based slicing
  2. Then, we need to extract Entity from these Chunks. This step requires the help of large models, which will analyze the content of each Chunk and extract key information
  3. After extracting Entity, we need to analyze the relationships between these Entity, which is Relation. This also requires large model assistance
  4. Finally, we need to store all this information. This includes:
    • Store the original Chunks
    • Store the vector representations of these Chunks
    • Store the extracted Entity information
    • Store the Relation information between Entity

Query process is relatively simple:

  1. First, we need to convert the query into a vector representation
  2. Then use this vector to find similar Chunks
  3. Extract Entity from these Chunks
  4. Use these Entity to find related Entity or Relation in the graph
  5. Finally, use all this information to generate a response

Table Structure
#

I mainly divided the tables into four categories, combining the kv storage, vector storage, and graph storage mentioned earlier. The following tables are introduced in order:

LIGHTRAG_DOC_FULL table is mainly used to store the original document. In LightRAG, kv storage is used.

CREATE TABLE LIGHTRAG_DOC_FULL (
    `id` BIGINT PRIMARY KEY AUTO_RANDOM,
    `doc_id` VARCHAR(256) NOT NULL,
    `workspace` varchar(1024),
    `content` LONGTEXT,
    `meta` JSON,
    `createtime` TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    `updatetime` TIMESTAMP DEFAULT NULL,
    UNIQUE KEY (`doc_id`)
);

LIGHTRAG_DOC_CHUNKS table is mainly used to store the sliced original document, i.e., Chunk.

In LightRAG, kv storage is used to store slices, and vector storage is used to store vector data of slices.

CREATE TABLE LIGHTRAG_DOC_CHUNKS (
    `id` BIGINT PRIMARY KEY AUTO_RANDOM,
    `chunk_id` VARCHAR(256) NOT NULL,
    `full_doc_id` VARCHAR(256) NOT NULL,
    `workspace` varchar(1024),
    `chunk_order_index` INT,
    `tokens` INT,
    `content` LONGTEXT,
    `content_vector` VECTOR,
    `createtime` DATETIME DEFAULT CURRENT_TIMESTAMP,
    `updatetime` DATETIME DEFAULT NULL,
    UNIQUE KEY (`chunk_id`)
);

LIGHTRAG_GRAPH_NODES table is mainly used to store the parsed Entity.

In LightRAG, vector storage is used to store the content and vector of Entity, and graph storage is used to store the information of Entity in the graph, such as entity_type, description, as attributes of Entity in the graph.

It’s also worth noting that LightRAG has already performed deduplication and merging on Entity before storage, ensuring that there are no identical nodes when building the graph.

CREATE TABLE LIGHTRAG_GRAPH_NODES (
    `id` BIGINT PRIMARY KEY AUTO_RANDOM,
    `entity_id` VARCHAR(256),
    `workspace` varchar(1024),
    `name` VARCHAR(2048),
    `entity_type` VARCHAR(1024),
    `description` LONGTEXT,
    `source_chunk_id` VARCHAR(256),
    `content` LONGTEXT,
    `content_vector` VECTOR,
    `createtime` DATETIME DEFAULT CURRENT_TIMESTAMP,
    `updatetime` DATETIME DEFAULT NULL,
    KEY (`entity_id`)
);

LIGHTRAG_GRAPH_EDGES table is mainly used to store Relations.

Similarly, in LightRAG, vector storage is used to store the content and vector of Relations, and graph storage is used to store the information of Relations in the graph, such as weight, keywords, description, etc.

It’s also worth noting that GraphRAG and LightRAG ultimately build an undirected graph with weights. So for this table, the combination of source_name and target_name will only appear once. This part of deduplication and merging is also done by LightRAG in the code.

As for why an undirected graph with weights is chosen? I personally guess that RAG retrieval mainly focuses on the connection relationships (semantic relevance) and weights (strength of association) between nodes in the graph, rather than the order of nodes in the graph. This way, only the relationship itself and the weight need to be considered, without worrying about who is the source and who is the target; of course, being undirected also reduces the complexity of the structure.

CREATE TABLE LIGHTRAG_GRAPH_EDGES (
    `id` BIGINT PRIMARY KEY AUTO_RANDOM,
    `relation_id` VARCHAR(256),
    `workspace` varchar(1024),
    `source_name` VARCHAR(2048),
    `target_name` VARCHAR(2048),
    `weight` DECIMAL,
    `keywords` TEXT,
    `description` LONGTEXT,
    `source_chunk_id` varchar(256),
    `content` LONGTEXT,
    `content_vector` VECTOR,
    `createtime` DATETIME DEFAULT CURRENT_TIMESTAMP,
    `updatetime` DATETIME DEFAULT NULL,
    KEY (`relation_id`)
);

SQL and Usage
#

Earlier, it was mentioned that there are two main parts: index construction and querying. The index construction part is relatively simple, mainly involving CRUD operations, so we’ll focus on the querying part.

The query process first performs vector retrieval and then uses the graph to query related Entity and Relation.

Let’s take a look at the vector retrieval part. The following is just an SQL example, which can be optimized and rewritten according to specific scenarios:

## Search for Entity/Node Vector
SELECT n.name as entity_name
FROM (
        SELECT entity_id as id,
            name,
            VEC_COSINE_DISTANCE(content_vector, :embedding_string) as distance
        FROM LIGHTRAG_GRAPH_NODES
        WHERE workspace = :workspace
    ) n
WHERE n.distance > :better_than_threshold
ORDER BY n.distance DESC
LIMIT :top_k;

## Search for Relation/Edge Vector
SELECT e.source_name as src_id,
    e.target_name as tgt_id
FROM (
        SELECT source_name,
            target_name,
            VEC_COSINE_DISTANCE(content_vector, :embedding_string) as distance
        FROM LIGHTRAG_GRAPH_EDGES
        WHERE workspace = :workspace
    ) e
WHERE e.distance > :better_than_threshold
ORDER BY e.distance DESC
LIMIT :top_k

Next, we need to query the detailed information of the vector retrieval results and get the degree of Entity or Relation:

  • The degree of Entity refers to the number of edges connected to this node
  • The degree of Relation refers to the sum of the degrees of the two Entity that make up the Relation

Here, only one SQL statement is needed, which is suitable for undirected graphs:

SELECT COUNT(id) AS cnt
FROM LIGHTRAG_GRAPH_EDGES
WHERE workspace = :workspace
    AND :name IN (source_name, target_name)

At this point, we have obtained the content and degrees of Entity or Relation through vector retrieval. The next step is to query related Relation or Entity based on these results:

  • Query the related edges (i.e., Relation) based on the Entity obtained from vector retrieval
  • Find the corresponding src_id and tgt_id based on the Relation obtained from vector retrieval, and query these nodes (i.e., Entity)

Finally, we need to query the related Relation or Entity of these Chunks, which involves some simple SQL queries that I won’t elaborate on here. Interested readers can download the source code to run it.

Conclusion
#

Obviously, from LightRAG’s perspective, TiDB has significant advantages compared to many storage engines that only handle relational or vector data.

Regarding the Graph part, I searched for TiGraph and related operators, but there are relatively few specific usage examples, so I haven’t delved deeper into it yet. ——TiGraph: 8,700x Computing Performance Achieved by Combining Graphs + the RDBMS Syntax

From my current understanding of TiDB and TiDB Vector, in the case of undirected graphs, a single TiDB is basically sufficient to complete the persistence of most scenarios of GraphRAG and its derivatives. Apart from this, only commercialized and mature Oracle or community-driven PostgreSQL can compete with it.

Finally, I recommend TiDB’s open-source code based on GraphRAG pingcap/autoflow. I’ve been too busy lately, and I’ll take a closer look when I have time.

Related

From paper to source code: a detailed explanation of the RAG algorithm
·9743 words·46 mins
RAG LLM AI
This article aims to explore the architectural design and specific code implementation of the RAG algorithm through the interpretation of papers and source code. This article mainly discusses GraphRAG, LightRAG and RAPTOR RAG, and also mentions Contextual Retrieval proposed by Anthropic and the evaluation method of the RAG algorithm. In the end, it is recommended that different methods be selected according to the size of the knowledge base document.
Rerank Models
·2502 words·12 mins
search AI RAG
With the popularity of the Transformer architecture, many Embedding and Rerank models are now based on this architecture. Taking this opportunity, we will sort out the process and history of the research, and take stock of the architectures adopted by several well-known Rerank models and the companies that developed them. Finally, we will return to the topic and briefly discuss whether Rerank should be used in RAG scenarios.
Mixed Expert (MoE) Model Notes
·1388 words·7 mins
MoE Large Model AI Paper Reading
This article mainly sorts out the relevant concepts of the hybrid expert model (MoE), and introduces the architectures and optimization methods of several open source MoE models, such as GShard, Switch Transformers, DeepSeek-MoE, and LLaMA-MoE. The characteristics and optimization methods of these models are also introduced.