Skip to main content
  1. Articels/

Rerank Models

·2502 words·12 mins
search AI RAG
Weaxs
Author
Weaxs
Table of Contents

Introduction
#

Embedding, vector search and rerank are topics that cannot be avoided when building a large RAG-based model application.

Embedding and vector search have been introduced in previous articles.

Vector similarity search methods
·3244 words·7 mins
Search algorithm RAG Vector database
This paper provides a detailed introduction to various vector similarity search methods, such as KD trees, IVF inverted indexes, HNSW and LSH. It provides a detailed introduction from data structures to algorithm implementations by analyzing the specific implementations in the source codes of Annoy, Faiss, PGVector and FALCONN.

This article mainly discusses the principles of Rerank, the development history after Transfomer, and finally whether it is necessary to use Rerank.

Rerank concept and application scenarios
#

First, let’s talk about Rerank. Intuitively, it is the reordering of search results. This concept itself is not new, and Rerank is widely used in traditional large-scale search engines and even in searches in apps. For example

In addition, the concept of rerank has long been used in the field of natural language processing (NLP). For example

If you want to know more about Rank, you can read “Learning to Rank for Information Retrieval

Therefore, reranking is mainly used for precise ranking in a small range after a large-scale search. Compared with more standardized search methods such as inverted indexing, vector or word frequency, which are used for large amounts of data, reranking mainly rearranges specific details and associations. For example, the above-mentioned personalized rearrangement according to the user.

Returning to the RAG scenario itself, vector search (embedding + vector library) is actually detrimental to the original natural language, especially in the embedding part, and different vector dimensions and models will have different degrees of detriment. As for full-text search (BM25), it uses word segmentation + word frequency matching, and does not have semantic search itself. Therefore, Rerank can directly and more accurately sort the original natural language after the search.

However, reranking algorithms are generally less efficient, so they are only suitable for precise ranking in a small range. At the same time, many current reranking algorithms also consume a lot of computing resources.

BERT-based reranking – encoder
#

From the introduction of Transformer in 17, to the Bert and GPT-1 models in 18. Researchers have found that this Transformer-based pre-trained language model has made good progress in natural language processing. Naturally, it is foreseeable that reranking for natural language may follow a similar path.

In January 2019, researchers proposed a Bert-based reranking model (*Passage Re-ranking with BERT. [2019.01]*). In the paper, the researchers use the Bert-LARGE model as a classification model, i.e., they use the [CLS] token in BERT. A single-layer neural network is used to obtain a probability output from the input, and the text is reordered based on these probabilities, similar to Cross-Encoder.

$$ \mathbf{Input}:\space \mathrm{[CLS]} \space query\_token\_ids \space \mathrm{[SEP]}\space doc\_token\_ids\space \mathrm{[SEP]} $$

This part is also directly reflected in the code, mainly the part that converts the dataset to tensorflow Features:

# https://github.com/nyu-dl/dl4marco-bert/blob/master/convert_msmarco_to_tfrecord.py

def write_to_tf_record(writer, tokenizer, query, docs, labels,
                       ids_file=None, query_id=None, doc_ids=None):
  query = tokenization.convert_to_unicode(query)
  # Convert the query token id, adding [CLS] in front
  query_token_ids = tokenization.convert_to_bert_input(
      text=query, max_seq_length=FLAGS.max_query_length, tokenizer=tokenizer, 
      add_cls=True)

query_token_ids_tf = tf.train.Feature(
      int64_list=tf.train.Int64List(value=query_token_ids))

  for i, (doc_text, label) in enumerate(zip(docs, labels)):
  # Convert to doc token ids, without adding [CLS
    doc_token_id = tokenization.convert_to_bert_input(
          text=tokenization.convert_to_unicode(doc_text),
          max_seq_length=FLAGS.max_seq_length - len(query_token_ids),
          tokenizer=tokenizer,
          add_cls=False)

    doc_ids_tf = tf.train.Feature(
        int64_list=tf.train.Int64List(value=doc_token_id))
        
  # Dataset labels
    labels_tf = tf.train.Feature(
        int64_list=tf.train.Int64List(value=[label]))
  
    features = tf.train.Features(feature={
        'query_ids': query_token_ids_tf,
        'doc_ids': doc_ids_tf,
        'label': labels_tf,
    })
nyu-dl/dl4marco-bert

Python
477
87

Finally, it should be mentioned that the paper team tried to use the MS MARCO dataset and the TREC-CAR dataset to train and evaluate Bert, and achieved good results.

In a similar way, another paper *Understanding the Behaviors of BERT in Ranking. [2019.04]* conducted further research. Four BERT-based ranking methods are proposed, in which it is assumed that: \(q\) represents; \(d\) represents the document; \(qd\) represents the concatenation of the query and document, which is marked with a [SEP] sign; and \(last\) represents the last layer in Bert, i.e., \(k = 24\):

  1. BERT(Rep): Calculate the embedding representations of the query and document separately, and then calculate the cosine distance between the embeddings. This method is similar to Bi-Encoder.

$$ \mathbf{Input 1}: \space [CLA] query\_token\_ids, [SEP] \\ space doc\_token\_ids \space \mathrm{[SEP]} \\ \mathbf{BERT(Rep)}(q,d) = cos(\overrightarrow{q}_{cls}^{last}, \overrightarrow{d}_{cls}^{last}) $$

  1. BERT(Last-Int): Like the above paper, this is the most recommended by BERT. Here, the \(w\) refers to the weight, i.e., a linear combination with the \(w\) . This method is similar to Cross-Encoder.

$$ \mathbf{Input}: \space [CLS] query_token_ids \space [SEP] doc_token_ids \space [SEP] \\ \mathbf{BERT(Last\text{-}Int)}(q,d) = w^T \overrightarrow{qd}^{last}_{cls} $$

  1. BERT(Mult-Int): Based on the method, matching features from different Bert layers are added to investigate whether different layers in Bert provide different information.

$$ \mathbf{BERT(Mult\text{-}Int)}(q,d) = \sum_{1\le k \le 24} (w^k_{Mult})^T \overrightarrow{qd}^{k}_{cls} $$

  1. BERT(Term-Trans): This is an enhanced version of 1+3, with a neural ranking network added to Bert. First, the cosine distance between the embeddings of the query and document is calculated using a transformation matrix between them. Then, the transformation matrices of all Bert layers are combined using mean aggregation and a linear combination.

$$ s^k(q,d) = Mean_{i,j}(\cos(rule(P^k\overrightarrow{q}_i^k),rule(P^k\overrightarrow{d}_i^k)))\\ BERT(Term-Trans) (q,d) = sum_{1\le k \le 24} w^k_{trans}s^k(q,d) $$

The final conclusion shows that Bert-based Rank outperforms current neural network retrieval models in MS MARCO re-ranking, which means it is very effective in the question-answering scenario. However, it performs poorly in TREC Web Track ad-hoc document ranking, because TREC-style document ranking is related to user clicks in the scenario, rather than the text context itself.

There are also some classic papers in the Bert scenario that will not be repeated here, such as [2019.10] Multi-Stage Document Ranking with BERT. Those who are interested can have a look.

GPT-based Rerank – Decoder
#

After the emergence of the Transformer, research on Rank or Rerank based on the Transformer architecture has mainly focused on the Bert model, which is mainly based on the encoder. After ChatGPT came out, some researchers’ perspective gradually shifted to the decoder-based GPT model, exploring how to use the decoder for reranking (*SGPT: GPT Sentence Embeddings for Semantic Search. [2022.02]*).

SGPT.png

Muennighoff/sgpt

SGPT: GPT Sentence Embeddings for Semantic Search

Jupyter Notebook
854
54

SGPT proposes two methods:

  1. SGPT Cross-Encoder: This literally means Cross-Encoding with GPT. The specific method is to splice the query and document together and encode them together, and then calculate the score based on the obtained log probabilities.
## https://github.com/Muennighoff/sgpt/blob/main/README.md#asymmetric-semantic-search-ce

prompt = 'Documents are searched to find matches with the same content. The document "{}" is a good search result for "'

for query in queries:
    print(f"Query: {query}")
    for doc in docs:
        context = prompt.format(doc)

        context_enc = tokenizer.encode(context, add_special_tokens=False)
        continuation_enc = tokenizer.encode(query, add_special_tokens=False)
        ## Concatenate query and document 
        model_input = torch.tensor(context_enc+continuation_enc[:-1])
        continuation_len = len(continuation_enc)
        input_len, = model_input.shape

        ## Get the log probabilities
        logprobs = torch.nn.functional.log_softmax(model(model_input)[0], dim=-1).cpu()
        logprobs = logprobs[input_len-continuation_len:]
        ## Gather the log probabilities of the continuation tokens -> [continuation_len]
        logprobs = torch.gather(logprobs, 1, torch.tensor(continuation_enc).unsqueeze(-1)).squeeze(-1)
        score = torch.sum(logprobs)
        ## The higher (closer to 0), the more similar
        print(f"Document: {doc[:20] + ...} Score: {score}")
  1. SGPT Bi-Encoder: Similarly, GPT is used to implement a Bi-Encoder. The specific method is relatively simple, which is to calculate the Embedding of the query and document separately, and then calculate the cosine distance.
## https://github.com/Muennighoff/sgpt/blob/main/README.md#asymmetric-semantic-search-be

## Calculate the embedding of the query and the document separately 
query_embeddings = get_weightedmean_embedding(tokenize_with_specb(queries, is_query=True), model)
doc_embeddings = get_weightedmean_embedding(tokenize_with_specb(docs, is_query=False), model)

## Calculate the cosine distance, the score is in [-1, 1]
cosine_sim_0_1 = 1 - cosine(query_embeddings[0], doc_embeddings[0])
cosine_sim_0_2 = 1 - cosine(query_embeddings[0], doc_embeddings[1])
cosine_sim_0_3 = 1 - cosine(query_embeddings[0], doc_embeddings[2])

Rerank model inventory
#

Finally, let’s take an inventory of some well-known and widely used rerank models, such as BGE from Zhiquan Lab, GTE from Ali, Jina AI, Cohere, etc., to see how they are implemented.

One thing to note is that, as we can see above, Rerank’s score calculation and comparison rely on the model Embedding itself. Therefore, these well-known search organizations or companies mainly focus on Embedding when publishing papers—how to make models better represent and understand semantics. As for Rerank, it is basically based on Embedding research and can be done by Cross-Encoder or Bi-Encoder.

Therefore, we will see more Embedding-related models below.

BGE Rerank
#

Two papers related to BGE Embedding and Rerank models have been found so far. The specific models can be found directly on Huggingface BAAI, and the source code is located at the following address:

FlagOpen/FlagEmbedding

Retrieval and Retrieval-augmented LLMs

Python
7786
568

Here is a brief overview of their two papers:

This paper focuses on the pain points of the decoder-only model—the model focuses on the current local and near-future semantics when outputting Embedding, ignoring the semantics of the overall context of the sentence.

The LLaRA architecture is proposed: During pretraining, the model is made to generate two Embeddings. One is EBAR (Embedding-Based Auto-Encoding), the Embedding of the original sentence, which is used for prediction of the next token, and the other is EBAR (Embedding-Based Auto-Regression), the Embedding of the next sentence, which is used for prediction of sentence-level features, so as to enhance the model’s understanding of the overall context.

BGE-V2.png

BGE-V3.png

Let’s go back to Rerank itself. After the Embedding-related settings are done, Rerank is relatively simple.

BGE-Rerank is also fine-tuned on the basis of XLM-RoBERTa. The specific method uses the Cross-Encoder. For details, see the source code: FlagOpen/FlagEmbedding - Reranker.

Jina Rerank
#

Jina’s embedding and reranking models are mainly pre-trained and fine-tuned based on Bert and variants of the Bert model. For an example of the model, see Huggingface jinaai. The source code for the model does not seem to be public, so here we will mainly look at related papers:

  • [2023.07] Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models:
    The first version of Jina’s Embedding is based on the **T5 model ** (*Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. [2019.10]*) for pre-training, and secondly, the Jina team designed a new dataset specifically for training the Embedding model.
  • [2023.10] Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents:
    Jina’s second version of Embedding is pre-trained on the basis of the pretrained on the Bert model, using linear bias Attention (ALiBi) to increase the contextual information of the model input; secondly, the text data is used to fine-tune the model’s embedding ability; finally, adversarial example fine-tuning is added to further improve the model’s ability.
  • [2024.09] jina-embeddings-v3: Multilingual Embeddings With Task LoRA:
    Jina’s third version of Embedding is a relatively new model. It is trained and fine-tuned based on the XLM-RoBERTa model (here, BGE-M3 is borrowed), and FlashAttention 2 is introduced to improve computational efficiency.
    One of the more innovative points is that the Jina team has introduced five task-specific LoRA adapters. These five adapters are trained independently. They are: ① retrieval.passage: used to embed documents in the query document retrieval task, mainly paragraph embedding in asymmetric retrieval ② retrieval.query: in the query query in query-document retrieval tasks, mainly query embedding in asymmetric retrieval ③ separation: cluster analysis of documents for clustering and reordering tasks ④ classification: text classification for classification tasks ⑤ text-matching: semantic text similarity matching for tasks designed for semantic similarity, such as symmetric retrieval.

JINA-V3.png

Back to Rerank, currently Jina has not released a Rerank model based on jina-embedding-v3 (it is estimated to be released soon).

In terms of jina-Reranker-v2 alone, it uses the same Cross-Encoder as BGE-M3.

GTE Rerank
#

The GTE-related Embedding and Rerank were released by the Alibaba team. Alibaba itself has a large model Qwen, so the retrieval-related embedding and reranking are also basically trained and fine-tuned based on Qwen. The specific model address can also be found in Huggingface Alibaba-NL.

For related papers, the main one is this one: Towards General Text Embeddings with Multi-stage Contrastive Learning; I’m too lazy to write it here.

Why do we need Rerank?
#

As mentioned above, many current Reranks are based on Embedding, which itself is trained and fine-tuned based on a large model with relatively few parameters.

As also mentioned above, many Embedding models have been working on how to better embed contextual semantics while supporting multiple languages, long texts, and so on. So why do we need Rerank when we can achieve good results with just an Embedding model + vector retrieval algorithm?

In principle, I personally think that the difference lies mainly in two points:

  1. At the level of embedding the embedding vector, the Rerank model currently mostly uses Cross-Encoder, which means that the query and document are spliced together and then embedded; while using vector retrieval, the query and document are vectorized separately, which is more similar to the Bi-Encoder mentioned above
  2. In terms of score setting, the Rerank model using Cross-Encoder will use a logarithmic probability approach, although there are still some fine-tuning, adapters, etc., which may affect the final score. Vector search is a pure algorithm, and there are various metric algorithms. The main one is still the cosine distance.

From the perspective of actual data, here we quote the data from Cohere. Overall, Rerank still has some improvement

Cohere Rerank.png

Of course, the final conclusion is that I personally think

  • If there is a requirement for the real-time nature of the search, then I think vector search is sufficient
  • If further accuracy is required, then reranking after the search is definitely better

References
#

SIGIR 2008 Workshop Learning to Rank for Information Retrieval

[2009.03] Learning to Rank for Information Retrieval

[2009.07] Where to stop reading a ranked list?: threshold optimization using truncated score distributions

[2018.10] Seq2Slate: Re-ranking and Slate Optimization with RNNs

[2019.01] Passage Re-ranking with BERT

[2019.04] Understanding the Behaviors of BERT in Ranking

[2019.10] Multi-Stage Document Ranking with BERT

[2019.10] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

[2022.02] SGPT: GPT Sentence Embeddings for Semantic Search

[2023.07] Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models

[2023.08] Towards General Text Embeddings with Multi-stage Contrastive Learning

[2023.10] Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents

[2023.12] Making Large Language Models A Better Foundation For Dense Retrieval

[2024.02] BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

[2024.09] jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Say Goodbye to Irrelevant Search Results: Cohere Rerank Is Here

Related

Mixed Expert (MoE) Model Notes
·1388 words·7 mins
MoE Large Model AI Paper Reading
This article mainly sorts out the relevant concepts of the hybrid expert model (MoE), and introduces the architectures and optimization methods of several open source MoE models, such as GShard, Switch Transformers, DeepSeek-MoE, and LLaMA-MoE. The characteristics and optimization methods of these models are also introduced.
Vector similarity search methods
·3244 words·7 mins
Search algorithm RAG Vector database
This paper provides a detailed introduction to various vector similarity search methods, such as KD trees, IVF inverted indexes, HNSW and LSH. It provides a detailed introduction from data structures to algorithm implementations by analyzing the specific implementations in the source codes of Annoy, Faiss, PGVector and FALCONN.