Introduction #
In January 2024, Late Talk invited Yuan Jinhui, who was starting another venture, to discuss entrepreneurial directions. During the podcast, the viewpoint that “inference demands for large models will exceed training demands” was particularly intriguing.
At that time, foundation model teams were still pursuing the path of pre-training Scaling Laws.
It wasn’t until September when OpenAI introduced O1—teaching models to use Chain of Thought reasoning through Reinforcement Learning (RL) and incorporating test-time compute in the reasoning process, significantly improving the model’s logical processing and mathematical capabilities. However, OpenAI didn’t disclose many technical details, only providing a general technical direction.
Recently, DeepSeek-R1 and Kimi k1.5 have emerged, bringing breakthroughs in this field. This article summarizes the latest research findings from these two models based on their papers.
DeepSeek-R1 #
The DeepSeek paper mentions two models—DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero is built directly on the base pre-trained model (DeepSeek-V3), trained using RL + CoT, proving that even without fine-tuning (SFT), model reasoning capabilities can be improved through large-scale reinforcement learning (RL) alone.
As for DeepSeek-R1, it uses both DeepSeek-R1-Zero and DeepSeek-V3 to generate CoT reasoning data and non-reasoning data, then performs SFT fine-tuning on the DeepSeek-R1 model. It undergoes secondary reinforcement learning based on Zero to enhance usefulness and harmlessness. Most importantly, DeepSeek-R1 undergoes distillation, transferring reasoning capabilities to smaller dense models.
Below, we’ll discuss reinforcement learning, CoT, and distillation in detail.
GPRO Reinforcement Learning #
In terms of reinforcement learning, DeepSeek-R1 uses GRPO (Group Relative Policy Optimization, referenced from DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models*).
GRPO is an improvement based on PPO, removing the Value Model and replacing the Generalized Advantage Estimation (GAE) function with Group Computation.
This improvement was made because the Value Model in PPO typically has the same scale as the Policy Model, greatly increasing memory and computational costs and burden. Removing the Value Model also reduces the overall algorithm complexity.
According to the paper, the algorithm process is as follows:
Reward Model Training #
DeepSeek-R1 adopts rule-based reward models, mainly divided into two types:
- Accuracy rewards: Used to evaluate whether the model’s final response is correct, requiring the final answer to be provided in a specified output format.
- Format rewards: Requiring the model to place the thinking process between
<think>
and</think>
tags.
Training Template #
To ensure R1 models follow specified formats during reinforcement learning training, a simple template needs to be designed. This template ensures that when outputting results, the model prioritizes generating the reasoning process (Chain of Thought) before generating the final answer.
The template is roughly as follows:
A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user
with the answer. The reasoning process and answer are enclosed within <think> </think> and
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant:
Model Distillation #
DeepSeek attempted to distill R1’s reasoning capabilities and significantly improved smaller dense models’ reasoning abilities through fine-tuning. The researchers selected 800,000 sample data points (600,000 reasoning data + 200,000 non-reasoning data) using DeepSeek-R1 and directly fine-tuned Qwen and Llama based on this data.
Based on previous knowledge points, we know that model distillation generally falls into response-based and feature-based categories. Currently, DeepSeek-R1’s distillation method appears to be response-based, using the softmax
function to make the student model’s logits
closer to the teacher model, as shown in the following code:
## https://github.com/hkust-nlp/simpleRL-reason/blob/main/train/openrlhf/models/loss.py#L238
import torch.nn.functional as F
from torch import nn
import torch
## Adapted from https://github.com/microsoft/LMOps/blob/main/minillm/finetune.py#L166
class KDLoss(nn.Module):
"""
Language Model Knowledge Distillation Loss
"""
def __init__(self):
super().__init__()
self.IGNORE_INDEX = -100
def forward(self, logits: torch.Tensor, teacher_logits: torch.Tensor, label: torch.Tensor) -> torch.Tensor:
teacher_probs = F.softmax(teacher_logits, dim=-1, dtype=torch.float32)
inf_mask = torch.isinf(logits)
logprobs = F.log_softmax(logits, dim=-1, dtype=torch.float32)
prod_probs = torch.masked_fill(teacher_probs * logprobs, inf_mask, 0)
x = torch.sum(prod_probs, dim=-1).view(-1)
mask = (label != self.IGNORE_INDEX).int()
distil_loss = -torch.sum(x * mask.view(-1), dim=0) / torch.sum(mask.view(-1), dim=0)
return distil_loss
Kimi k1.5 #
The development of Kimi k1.5 went through several stages: pre-training, supervised fine-tuning (SFT), long-CoT supervised fine-tuning, and reinforcement learning (RL).
Kimi has always had an advantage in supporting long contexts, and this time they also introduced long-text RL during the reinforcement learning phase to enable models to provide more detailed solutions and improve response accuracy.
This paper mainly introduces the long-CoT supervised fine-tuning and reinforcement learning (RL) used in Kimi k1.5.
Prompt Set Management #
The quality and diversity of prompt sets play a crucial role in the effectiveness of RL. The Kimi k1.5 paper summarizes three characteristics of high-quality prompt sets:
- Diverse Coverage: The prompt set needs to cover a wide range of subjects to enhance model adaptability and ensure universality across different domains.
- Balanced Difficulty: The prompt set needs to have an even distribution of simple, medium, and difficult questions to promote gradual learning and prevent overfitting to complex problems.
- Accurate Evaluability: Prompts need to be evaluated objectively and reliably to ensure model thinking is based on correct reasoning processes.
To evaluate prompts, Kimi uses the model’s own capabilities to assess the difficulty of each prompt: For each prompt, they set the fine-tuned SFT model to a higher temperature and generate 10 answers; then calculate the accuracy rate among these 10 answers, with lower accuracy indicating higher difficulty.
There are also cases where models arrive at correct answers through incorrect reasoning. For such cases, the paper proposes a simple but effective method to identify and filter out these prompts: prompt the model to guess possible answers without requiring any CoT reasoning steps. If the model obtains the correct answer after N attempts, the prompt is considered easy and filtered out. The Kimi team ultimately set N to 8.
Long-CoT SFT #
Based on the above rules and filtering, plus rejection sampling, the Kimi team produced some high-quality CoT datasets. These generated datasets aim to encapsulate several key reasoning processes:
- planning ability: Systematically planning overall steps before producing results
- evaluation ability: Used for critical evaluation of intermediate steps
- reflection ability: Rethinking and refining solutions
- exploration ability: Ability to consider alternative solutions
Through the fine-tuned Long-CoT model, more detailed and logically coherent results can be generated, thereby improving accuracy in reasoning tasks.
Reinforcement Learning #
Above, we mainly discussed long-text reasoning datasets and their fine-tuning for long-text reasoning data. Below are the strategies Kimi k1.5 introduced in long-text reinforcement learning:
- Length Penalty: Due to Long-CoT fine-tuning, the model exhibited over-thinking phenomena. To address this, length rewards were introduced to suppress excessive token length—penalizing longer answers within the correct answer set while also penalizing incorrect long answers. This length reward was added to the original reward as a
weighting
parameter. - Sampling Strategies
- Curriculum Sampling: Starting with training simple tasks and gradually developing to more challenging tasks. During data collection, labels for registration and difficulty are included, allowing training to progress from easy to difficult based on these labels.
- Prioritized Sampling: Tracking the success rate of each problem and sampling problems based on success rates. Higher sampling probabilities are provided for problems with lower success rates, guiding the model to learn and improve in weak areas during the RL process.
- long2short reinforcement learning: After standard RL, selecting a model that achieves the best balance between performance and token efficiency as the base model for separate long2short reinforcement learning. This part also uses length penalty to further suppress correct but longer answers.
References #
Fully open reproduction of DeepSeek-R1
This is a replicate of DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via…