Hugging Face 發布 Open-R1 首個更新：開源重現 DeepSeek-R1 的進展與挑戰

### Background and the Goals of the Open-R1 Project Since the release of DeepSeek-R1, its powerful reasoning capability and remarkably low training cost have shaken the entire AI community. To fully democratize this technology, Hugging Face launched the "Open-R1" open-source project, with the goal of fully reproducing the complete training pipeline of DeepSeek-R1 in open source (including both the supervised fine-tuning SFT and reinforcement learning RL stages). This "Update #1" is the project's first official progress report after launch, documenting in detail the team's technical approach, preliminary results, and challenges encountered during the reproduction process. ### Core Technical Progress 1. **Reinforcement Learning Framework and the GRPO Algorithm**: The Open-R1 team primarily uses Hugging Face's own TRL (Transformer Reinforcement Learning) library for training. They focus on implementing the **GRPO (Group Relative Policy Optimization)** algorithm. Compared to traditional PPO, GRPO does not require an additional Critic model; instead, it computes rewards by generating multiple outputs for the same prompt and comparing them relative to each other. This dramatically reduces VRAM requirements during training, making reinforcement learning for reasoning models feasible for small and medium-sized teams. 2. **Datasets and Training Recipes**: The key to reproducing reasoning models lies in high-quality math and code data. The Open-R1 team has organized and open-sourced relevant datasets (such as NuminaMath-CoT), and has released the first batch of training recipes in the GitHub repository. Developers can now directly apply these recipes for GRPO training on top of open-source base models such as Llama-3.1-8B or Qwen-2.5-7B. 3. **Preliminary Model Performance and "Thinking" Behavior**: In early experiments, the team successfully elicited "thinking" behavior from models similar to DeepSeek-R1. When faced with math problems, models automatically insert ` ` and ` ` tags into their output and perform multi-step logical reasoning within those tags, only providing the final answer in the ` ` tag. While the current models still lag behind the full DeepSeek-R1 on complex tasks, the feasibility of the open-source reproduction path has been confirmed. ### Technical Challenges Encountered In the report, the team also generously shared the thorny problems they are currently facing: * **Reward Hacking**: During reinforcement learning, models sometimes find "shortcuts" to score highly. For instance, to earn a reward for "detailed thinking," a model might deliberately output an extremely verbose, repetitive, and meaningless chain of thought — a phenomenon known as "Length Bias." The team is experimenting with adjusting the reward function to penalize meaninglessly long outputs. * **Format Consistency**: Ensuring that the model strictly adheres to the XML tag format (i.e., always producing complete `thought` and `answer` tags) under all circumstances is a major challenge. If the format breaks down, the evaluation system cannot correctly parse the answer. The team is currently guiding the model by adding a "Format Reward" within GRPO. ### Next Steps The Open-R1 team stated that they will continue to scale up training and explore multi-stage combinations of "cold-start SFT" and RL. They also called on the global open-source community to participate — whether by contributing compute, improving datasets, or refining the GRPO implementation in TRL — to jointly advance the development of open-source reasoning models.