Introduction #
Embedding, vector search and rerank are topics that cannot be avoided when building a large RAG-based model application.
Embedding and vector search have been introduced in previous articles.
This article mainly discusses the principles of Rerank, the development history after Transfomer, and finally whether it is necessary to use Rerank.
Rerank concept and application scenarios #
First, let’s talk about Rerank. Intuitively, it is the reordering of search results. This concept itself is not new, and Rerank is widely used in traditional large-scale search engines and even in searches in apps. For example
- Baidu search will rerank search results based on user preferences after retrieving the content
- Dianping will rerank search results based on user preferences on the client side after the server has finished searching (*End-to-end intelligent reranking in Dianping search.[2022-02]*)
In addition, the concept of rerank has long been used in the field of natural language processing (NLP). For example
- Discriminative reranking for natural language parsing (*Discriminative Reranking for natural language parsing. [2005-03]*)
Therefore, reranking is mainly used for precise ranking in a small range after a large-scale search. Compared with more standardized search methods such as inverted indexing, vector or word frequency, which are used for large amounts of data, reranking mainly rearranges specific details and associations. For example, the above-mentioned personalized rearrangement according to the user.
Returning to the RAG scenario itself, vector search (embedding + vector library) is actually detrimental to the original natural language, especially in the embedding part, and different vector dimensions and models will have different degrees of detriment. As for full-text search (BM25), it uses word segmentation + word frequency matching, and does not have semantic search itself. Therefore, Rerank can directly and more accurately sort the original natural language after the search.
However, reranking algorithms are generally less efficient, so they are only suitable for precise ranking in a small range. At the same time, many current reranking algorithms also consume a lot of computing resources.
BERT-based reranking – encoder #
From the introduction of Transformer in 17, to the Bert and GPT-1 models in 18. Researchers have found that this Transformer-based pre-trained language model has made good progress in natural language processing. Naturally, it is foreseeable that reranking for natural language may follow a similar path.
In January 2019, researchers proposed a Bert-based reranking model (*Passage Re-ranking with BERT. [2019.01]*). In the paper, the researchers use the Bert-LARGE model as a classification model, i.e., they use the [CLS] token in BERT. A single-layer neural network is used to obtain a probability output from the input, and the text is reordered based on these probabilities, similar to Cross-Encoder.
$$ \mathbf{Input}:\space \mathrm{[CLS]} \space query\_token\_ids \space \mathrm{[SEP]}\space doc\_token\_ids\space \mathrm{[SEP]} $$
This part is also directly reflected in the code, mainly the part that converts the dataset to tensorflow Features:
# https://github.com/nyu-dl/dl4marco-bert/blob/master/convert_msmarco_to_tfrecord.py
def write_to_tf_record(writer, tokenizer, query, docs, labels,
ids_file=None, query_id=None, doc_ids=None):
query = tokenization.convert_to_unicode(query)
# Convert the query token id, adding [CLS] in front
query_token_ids = tokenization.convert_to_bert_input(
text=query, max_seq_length=FLAGS.max_query_length, tokenizer=tokenizer,
add_cls=True)
query_token_ids_tf = tf.train.Feature(
int64_list=tf.train.Int64List(value=query_token_ids))
for i, (doc_text, label) in enumerate(zip(docs, labels)):
# Convert to doc token ids, without adding [CLS
doc_token_id = tokenization.convert_to_bert_input(
text=tokenization.convert_to_unicode(doc_text),
max_seq_length=FLAGS.max_seq_length - len(query_token_ids),
tokenizer=tokenizer,
add_cls=False)
doc_ids_tf = tf.train.Feature(
int64_list=tf.train.Int64List(value=doc_token_id))
# Dataset labels
labels_tf = tf.train.Feature(
int64_list=tf.train.Int64List(value=[label]))
features = tf.train.Features(feature={
'query_ids': query_token_ids_tf,
'doc_ids': doc_ids_tf,
'label': labels_tf,
})
Finally, it should be mentioned that the paper team tried to use the MS MARCO dataset and the TREC-CAR dataset to train and evaluate Bert, and achieved good results.
In a similar way, another paper *Understanding the Behaviors of BERT in Ranking. [2019.04]* conducted further research. Four BERT-based ranking methods are proposed, in which it is assumed that: \(q\) represents; \(d\) represents the document; \(qd\) represents the concatenation of the query and document, which is marked with a [SEP] sign; and \(last\) represents the last layer in Bert, i.e., \(k = 24\):
- BERT(Rep): Calculate the embedding representations of the query and document separately, and then calculate the cosine distance between the embeddings. This method is similar to Bi-Encoder.
$$ \mathbf{Input 1}: \space [CLA] query\_token\_ids, [SEP] \\ space doc\_token\_ids \space \mathrm{[SEP]} \\ \mathbf{BERT(Rep)}(q,d) = cos(\overrightarrow{q}_{cls}^{last}, \overrightarrow{d}_{cls}^{last}) $$
- BERT(Last-Int): Like the above paper, this is the most recommended by BERT. Here, the \(w\) refers to the weight, i.e., a linear combination with the \(w\) . This method is similar to Cross-Encoder.
$$ \mathbf{Input}: \space [CLS] query_token_ids \space [SEP] doc_token_ids \space [SEP] \\ \mathbf{BERT(Last\text{-}Int)}(q,d) = w^T \overrightarrow{qd}^{last}_{cls} $$
- BERT(Mult-Int): Based on the method, matching features from different Bert layers are added to investigate whether different layers in Bert provide different information.
$$ \mathbf{BERT(Mult\text{-}Int)}(q,d) = \sum_{1\le k \le 24} (w^k_{Mult})^T \overrightarrow{qd}^{k}_{cls} $$
- BERT(Term-Trans): This is an enhanced version of 1+3, with a neural ranking network added to Bert. First, the cosine distance between the embeddings of the query and document is calculated using a transformation matrix between them. Then, the transformation matrices of all Bert layers are combined using mean aggregation and a linear combination.
$$ s^k(q,d) = Mean_{i,j}(\cos(rule(P^k\overrightarrow{q}_i^k),rule(P^k\overrightarrow{d}_i^k)))\\ BERT(Term-Trans) (q,d) = sum_{1\le k \le 24} w^k_{trans}s^k(q,d) $$
The final conclusion shows that Bert-based Rank outperforms current neural network retrieval models in MS MARCO re-ranking, which means it is very effective in the question-answering scenario. However, it performs poorly in TREC Web Track ad-hoc document ranking, because TREC-style document ranking is related to user clicks in the scenario, rather than the text context itself.
There are also some classic papers in the Bert scenario that will not be repeated here, such as [2019.10] Multi-Stage Document Ranking with BERT. Those who are interested can have a look.
GPT-based Rerank – Decoder #
After the emergence of the Transformer, research on Rank or Rerank based on the Transformer architecture has mainly focused on the Bert model, which is mainly based on the encoder. After ChatGPT came out, some researchers’ perspective gradually shifted to the decoder-based GPT model, exploring how to use the decoder for reranking (*SGPT: GPT Sentence Embeddings for Semantic Search. [2022.02]*).
SGPT: GPT Sentence Embeddings for Semantic Search
SGPT proposes two methods:
- SGPT Cross-Encoder: This literally means Cross-Encoding with GPT. The specific method is to splice the query and document together and encode them together, and then calculate the score based on the obtained log probabilities.
## https://github.com/Muennighoff/sgpt/blob/main/README.md#asymmetric-semantic-search-ce
prompt = 'Documents are searched to find matches with the same content. The document "{}" is a good search result for "'
for query in queries:
print(f"Query: {query}")
for doc in docs:
context = prompt.format(doc)
context_enc = tokenizer.encode(context, add_special_tokens=False)
continuation_enc = tokenizer.encode(query, add_special_tokens=False)
## Concatenate query and document
model_input = torch.tensor(context_enc+continuation_enc[:-1])
continuation_len = len(continuation_enc)
input_len, = model_input.shape
## Get the log probabilities
logprobs = torch.nn.functional.log_softmax(model(model_input)[0], dim=-1).cpu()
logprobs = logprobs[input_len-continuation_len:]
## Gather the log probabilities of the continuation tokens -> [continuation_len]
logprobs = torch.gather(logprobs, 1, torch.tensor(continuation_enc).unsqueeze(-1)).squeeze(-1)
score = torch.sum(logprobs)
## The higher (closer to 0), the more similar
print(f"Document: {doc[:20] + ‘...’} Score: {score}")
- SGPT Bi-Encoder: Similarly, GPT is used to implement a Bi-Encoder. The specific method is relatively simple, which is to calculate the Embedding of the query and document separately, and then calculate the cosine distance.
## https://github.com/Muennighoff/sgpt/blob/main/README.md#asymmetric-semantic-search-be
## Calculate the embedding of the query and the document separately
query_embeddings = get_weightedmean_embedding(tokenize_with_specb(queries, is_query=True), model)
doc_embeddings = get_weightedmean_embedding(tokenize_with_specb(docs, is_query=False), model)
## Calculate the cosine distance, the score is in [-1, 1]
cosine_sim_0_1 = 1 - cosine(query_embeddings[0], doc_embeddings[0])
cosine_sim_0_2 = 1 - cosine(query_embeddings[0], doc_embeddings[1])
cosine_sim_0_3 = 1 - cosine(query_embeddings[0], doc_embeddings[2])
Rerank model inventory #
Finally, let’s take an inventory of some well-known and widely used rerank models, such as BGE from Zhiquan Lab, GTE from Ali, Jina AI, Cohere, etc., to see how they are implemented.
One thing to note is that, as we can see above, Rerank’s score calculation and comparison rely on the model Embedding itself. Therefore, these well-known search organizations or companies mainly focus on Embedding when publishing papers—how to make models better represent and understand semantics. As for Rerank, it is basically based on Embedding research and can be done by Cross-Encoder or Bi-Encoder.
Therefore, we will see more Embedding-related models below.
BGE Rerank #
Two papers related to BGE Embedding and Rerank models have been found so far. The specific models can be found directly on Huggingface BAAI, and the source code is located at the following address:
Retrieval and Retrieval-augmented LLMs
Here is a brief overview of their two papers:
-
[2023.12] Making Large Language Models A Better Foundation For Dense Retrieval
BGE Embedding The first version is mainly based on the LLaMA-2-7B (BASE) model for pretraining and fine-tuning
This paper focuses on the pain points of the decoder-only model—the model focuses on the current local and near-future semantics when outputting Embedding, ignoring the semantics of the overall context of the sentence.
The LLaRA architecture is proposed: During pretraining, the model is made to generate two Embeddings. One is EBAR (Embedding-Based Auto-Encoding), the Embedding of the original sentence, which is used for prediction of the next token, and the other is EBAR (Embedding-Based Auto-Regression), the Embedding of the next sentence, which is used for prediction of sentence-level features, so as to enhance the model’s understanding of the overall context.
-
BGE-M3 is a comprehensive update compared to version 2: the model is switched to the pre-trained XLM-RoBERTa, followed by support for multiple languages, multiple search methods, and so on. Finally, the model proposes a training framework for self-knowledge distillation, which will not be explained in detail here.
Let’s go back to Rerank itself. After the Embedding-related settings are done, Rerank is relatively simple.
BGE-Rerank is also fine-tuned on the basis of XLM-RoBERTa. The specific method uses the Cross-Encoder. For details, see the source code: FlagOpen/FlagEmbedding - Reranker.
Jina Rerank #
Jina’s embedding and reranking models are mainly pre-trained and fine-tuned based on Bert and variants of the Bert model. For an example of the model, see Huggingface jinaai. The source code for the model does not seem to be public, so here we will mainly look at related papers:
- [2023.07] Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models:
The first version of Jina’s Embedding is based on the **T5 model ** (*Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. [2019.10]*) for pre-training, and secondly, the Jina team designed a new dataset specifically for training the Embedding model. - [2023.10] Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents:
Jina’s second version of Embedding is pre-trained on the basis of the pretrained on the Bert model, using linear bias Attention (ALiBi) to increase the contextual information of the model input; secondly, the text data is used to fine-tune the model’s embedding ability; finally, adversarial example fine-tuning is added to further improve the model’s ability. - [2024.09] jina-embeddings-v3: Multilingual Embeddings With Task LoRA:
Jina’s third version of Embedding is a relatively new model. It is trained and fine-tuned based on the XLM-RoBERTa model (here, BGE-M3 is borrowed), and FlashAttention 2 is introduced to improve computational efficiency.
One of the more innovative points is that the Jina team has introduced five task-specific LoRA adapters. These five adapters are trained independently. They are: ① retrieval.passage: used to embed documents in the query document retrieval task, mainly paragraph embedding in asymmetric retrieval ② retrieval.query: in the query query in query-document retrieval tasks, mainly query embedding in asymmetric retrieval ③ separation: cluster analysis of documents for clustering and reordering tasks ④ classification: text classification for classification tasks ⑤ text-matching: semantic text similarity matching for tasks designed for semantic similarity, such as symmetric retrieval.
Back to Rerank, currently Jina has not released a Rerank model based on jina-embedding-v3 (it is estimated to be released soon).
In terms of jina-Reranker-v2 alone, it uses the same Cross-Encoder as BGE-M3.
GTE Rerank #
The GTE-related Embedding and Rerank were released by the Alibaba team. Alibaba itself has a large model Qwen, so the retrieval-related embedding and reranking are also basically trained and fine-tuned based on Qwen. The specific model address can also be found in Huggingface Alibaba-NL.
For related papers, the main one is this one: Towards General Text Embeddings with Multi-stage Contrastive Learning; I’m too lazy to write it here.
Why do we need Rerank? #
As mentioned above, many current Reranks are based on Embedding, which itself is trained and fine-tuned based on a large model with relatively few parameters.
As also mentioned above, many Embedding models have been working on how to better embed contextual semantics while supporting multiple languages, long texts, and so on. So why do we need Rerank when we can achieve good results with just an Embedding model + vector retrieval algorithm?
In principle, I personally think that the difference lies mainly in two points:
- At the level of embedding the embedding vector, the Rerank model currently mostly uses Cross-Encoder, which means that the query and document are spliced together and then embedded; while using vector retrieval, the query and document are vectorized separately, which is more similar to the Bi-Encoder mentioned above
- In terms of score setting, the Rerank model using Cross-Encoder will use a logarithmic probability approach, although there are still some fine-tuning, adapters, etc., which may affect the final score. Vector search is a pure algorithm, and there are various metric algorithms. The main one is still the cosine distance.
From the perspective of actual data, here we quote the data from Cohere. Overall, Rerank still has some improvement
Of course, the final conclusion is that I personally think
- If there is a requirement for the real-time nature of the search, then I think vector search is sufficient
- If further accuracy is required, then reranking after the search is definitely better
References #
SIGIR 2008 Workshop Learning to Rank for Information Retrieval
[2009.03] Learning to Rank for Information Retrieval
[2018.10] Seq2Slate: Re-ranking and Slate Optimization with RNNs
[2019.01] Passage Re-ranking with BERT
[2019.04] Understanding the Behaviors of BERT in Ranking
[2019.10] Multi-Stage Document Ranking with BERT
[2019.10] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
[2022.02] SGPT: GPT Sentence Embeddings for Semantic Search
[2023.07] Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models
[2023.08] Towards General Text Embeddings with Multi-stage Contrastive Learning
[2023.10] Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents
[2023.12] Making Large Language Models A Better Foundation For Dense Retrieval
[2024.09] jina-embeddings-v3: Multilingual Embeddings With Task LoRA
Say Goodbye to Irrelevant Search Results: Cohere Rerank Is Here