Skip to main content
  1. Articels/

Chain of Thought + Reinforcement Learning: Innovations in DeepSeek-R1 and Kimi k1.5 Papers

·1302 words·7 mins
AI LLM CoT Reinforcement Learning DeepSeek Kimi Model Distillation Chain of Thought
Weaxs
Author
Weaxs
Table of Contents

Introduction
#

In January 2024, Late Talk invited Yuan Jinhui, who was starting another venture, to discuss entrepreneurial directions. During the podcast, the viewpoint that “inference demands for large models will exceed training demands” was particularly intriguing.

At that time, foundation model teams were still pursuing the path of pre-training Scaling Laws.

It wasn’t until September when OpenAI introduced O1—teaching models to use Chain of Thought reasoning through Reinforcement Learning (RL) and incorporating test-time compute in the reasoning process, significantly improving the model’s logical processing and mathematical capabilities. However, OpenAI didn’t disclose many technical details, only providing a general technical direction.

Recently, DeepSeek-R1 and Kimi k1.5 have emerged, bringing breakthroughs in this field. This article summarizes the latest research findings from these two models based on their papers.

DeepSeek-R1
#

The DeepSeek paper mentions two models—DeepSeek-R1-Zero and DeepSeek-R1.

DeepSeek-R1-Zero is built directly on the base pre-trained model (DeepSeek-V3), trained using RL + CoT, proving that even without fine-tuning (SFT), model reasoning capabilities can be improved through large-scale reinforcement learning (RL) alone.

As for DeepSeek-R1, it uses both DeepSeek-R1-Zero and DeepSeek-V3 to generate CoT reasoning data and non-reasoning data, then performs SFT fine-tuning on the DeepSeek-R1 model. It undergoes secondary reinforcement learning based on Zero to enhance usefulness and harmlessness. Most importantly, DeepSeek-R1 undergoes distillation, transferring reasoning capabilities to smaller dense models.

DeepSeek-R1-Zero.png
DeepSeek-R1.png

Below, we’ll discuss reinforcement learning, CoT, and distillation in detail.

GPRO Reinforcement Learning
#

In terms of reinforcement learning, DeepSeek-R1 uses GRPO (Group Relative Policy Optimization, referenced from DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models*).

GRPO is an improvement based on PPO, removing the Value Model and replacing the Generalized Advantage Estimation (GAE) function with Group Computation.

This improvement was made because the Value Model in PPO typically has the same scale as the Policy Model, greatly increasing memory and computational costs and burden. Removing the Value Model also reduces the overall algorithm complexity.

GRPO.png

According to the paper, the algorithm process is as follows:

GRPO-Algorithm.png

Reward Model Training
#

DeepSeek-R1 adopts rule-based reward models, mainly divided into two types:

  1. Accuracy rewards: Used to evaluate whether the model’s final response is correct, requiring the final answer to be provided in a specified output format.
  2. Format rewards: Requiring the model to place the thinking process between <think> and </think> tags.

Training Template
#

To ensure R1 models follow specified formats during reinforcement learning training, a simple template needs to be designed. This template ensures that when outputting results, the model prioritizes generating the reasoning process (Chain of Thought) before generating the final answer.

The template is roughly as follows:

A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user
with the answer. The reasoning process and answer are enclosed within <think> </think> and
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant: 

Model Distillation
#

DeepSeek attempted to distill R1’s reasoning capabilities and significantly improved smaller dense models’ reasoning abilities through fine-tuning. The researchers selected 800,000 sample data points (600,000 reasoning data + 200,000 non-reasoning data) using DeepSeek-R1 and directly fine-tuned Qwen and Llama based on this data.

Based on previous knowledge points, we know that model distillation generally falls into response-based and feature-based categories. Currently, DeepSeek-R1’s distillation method appears to be response-based, using the softmax function to make the student model’s logits closer to the teacher model, as shown in the following code:

## https://github.com/hkust-nlp/simpleRL-reason/blob/main/train/openrlhf/models/loss.py#L238

import torch.nn.functional as F
from torch import nn
import torch

## Adapted from https://github.com/microsoft/LMOps/blob/main/minillm/finetune.py#L166
class KDLoss(nn.Module):
    """
    Language Model Knowledge Distillation Loss
    """

    def __init__(self):
        super().__init__()
        self.IGNORE_INDEX = -100

    def forward(self, logits: torch.Tensor, teacher_logits: torch.Tensor, label: torch.Tensor) -> torch.Tensor:
        teacher_probs = F.softmax(teacher_logits, dim=-1, dtype=torch.float32)
        inf_mask = torch.isinf(logits)
        logprobs = F.log_softmax(logits, dim=-1, dtype=torch.float32)
        prod_probs = torch.masked_fill(teacher_probs * logprobs, inf_mask, 0)
        x = torch.sum(prod_probs, dim=-1).view(-1)
        mask = (label != self.IGNORE_INDEX).int()
        distil_loss = -torch.sum(x * mask.view(-1), dim=0) / torch.sum(mask.view(-1), dim=0)

        return distil_loss

Kimi k1.5
#

The development of Kimi k1.5 went through several stages: pre-training, supervised fine-tuning (SFT), long-CoT supervised fine-tuning, and reinforcement learning (RL).

Kimi has always had an advantage in supporting long contexts, and this time they also introduced long-text RL during the reinforcement learning phase to enable models to provide more detailed solutions and improve response accuracy.

This paper mainly introduces the long-CoT supervised fine-tuning and reinforcement learning (RL) used in Kimi k1.5.

Prompt Set Management
#

The quality and diversity of prompt sets play a crucial role in the effectiveness of RL. The Kimi k1.5 paper summarizes three characteristics of high-quality prompt sets:

  • Diverse Coverage: The prompt set needs to cover a wide range of subjects to enhance model adaptability and ensure universality across different domains.
  • Balanced Difficulty: The prompt set needs to have an even distribution of simple, medium, and difficult questions to promote gradual learning and prevent overfitting to complex problems.
  • Accurate Evaluability: Prompts need to be evaluated objectively and reliably to ensure model thinking is based on correct reasoning processes.

To evaluate prompts, Kimi uses the model’s own capabilities to assess the difficulty of each prompt: For each prompt, they set the fine-tuned SFT model to a higher temperature and generate 10 answers; then calculate the accuracy rate among these 10 answers, with lower accuracy indicating higher difficulty.

There are also cases where models arrive at correct answers through incorrect reasoning. For such cases, the paper proposes a simple but effective method to identify and filter out these prompts: prompt the model to guess possible answers without requiring any CoT reasoning steps. If the model obtains the correct answer after N attempts, the prompt is considered easy and filtered out. The Kimi team ultimately set N to 8.

Long-CoT SFT
#

Based on the above rules and filtering, plus rejection sampling, the Kimi team produced some high-quality CoT datasets. These generated datasets aim to encapsulate several key reasoning processes:

  • planning ability: Systematically planning overall steps before producing results
  • evaluation ability: Used for critical evaluation of intermediate steps
  • reflection ability: Rethinking and refining solutions
  • exploration ability: Ability to consider alternative solutions

Through the fine-tuned Long-CoT model, more detailed and logically coherent results can be generated, thereby improving accuracy in reasoning tasks.

Reinforcement Learning
#

Above, we mainly discussed long-text reasoning datasets and their fine-tuning for long-text reasoning data. Below are the strategies Kimi k1.5 introduced in long-text reinforcement learning:

  • Length Penalty: Due to Long-CoT fine-tuning, the model exhibited over-thinking phenomena. To address this, length rewards were introduced to suppress excessive token length—penalizing longer answers within the correct answer set while also penalizing incorrect long answers. This length reward was added to the original reward as a weighting parameter.
  • Sampling Strategies
    • Curriculum Sampling: Starting with training simple tasks and gradually developing to more challenging tasks. During data collection, labels for registration and difficulty are included, allowing training to progress from easy to difficult based on these labels.
    • Prioritized Sampling: Tracking the success rate of each problem and sampling problems based on success rates. Higher sampling probabilities are provided for problems with lower success rates, guiding the model to learn and improve in weak areas during the RL process.
  • long2short reinforcement learning: After standard RL, selecting a model that achieves the best balance between performance and token efficiency as the base model for separate long2short reinforcement learning. This part also uses length penalty to further suppress correct but longer answers.

References
#

deepseek-ai/DeepSeek-R1

null
70281
9048
huggingface/open-r1

Fully open reproduction of DeepSeek-R1

Python
18112
1515
hkust-nlp/simpleRL-reason

This is a replicate of DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data

Python
2376
180
MoonshotAI/Kimi-k1.5

null
2729
150

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via…

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Related

Building a LightRAG Knowledge Base with TiDB Vector
·1445 words·7 mins
RAG LLM AI TiDB Engineering Practice
After reviewing LightRAG, I found that its persistence support was still limited, missing the most important TiDB (not really). So I took some time to contribute and write about it.
From paper to source code: a detailed explanation of the RAG algorithm
·9743 words·46 mins
RAG LLM AI
This article aims to explore the architectural design and specific code implementation of the RAG algorithm through the interpretation of papers and source code. This article mainly discusses GraphRAG, LightRAG and RAPTOR RAG, and also mentions Contextual Retrieval proposed by Anthropic and the evaluation method of the RAG algorithm. In the end, it is recommended that different methods be selected according to the size of the knowledge base document.
Rerank Models
·2502 words·12 mins
search AI RAG
With the popularity of the Transformer architecture, many Embedding and Rerank models are now based on this architecture. Taking this opportunity, we will sort out the process and history of the research, and take stock of the architectures adopted by several well-known Rerank models and the companies that developed them. Finally, we will return to the topic and briefly discuss whether Rerank should be used in RAG scenarios.