What is Light-R1?
Light-R1 is 360 Zhinao's open source A! model, focusing on long-term thinking chain reasoning in the field of mathematics, specifically Light-R1-32B. The model is based on Qwen2.5-32B-Instruct, trained with 70,000 mathematical data and two-stage course learning (SFT+DPO), and achieved performance that surpassed DeepSeekR1-Distil-Qwen-32B from scratch. In the AIME24 test, Light-R1 scored 76.6 points, significantly higher than DeepSeek-R1-Distil's 72.6 points. The model training cost is low, only 12 H800 machines are required to run for 6 hours, and the cost is about $1,000. The model supports full open source, including models, data sets, training frameworks, and evaluation codes, to promote the development of the open source community and provide a reference for low-cost training of specialized models in the field.
Main functions of Light-R1
Efficient math problem solving: can quickly and accurately solve complex math problems, including but not limited to algebra, geometry, probability and other fields.
Inference ability improvement: has strong logical reasoning ability and supports processing long thought chain problems.
Generalization ability: shows generalization ability in other fields (such as logical reasoning, language comprehension).
Low-cost training and deployment: extremely low cost to achieve high performance, suitable for rapid deployment and application by users or enterprises with limited resources.
Technical principles of Light-R1
Basic model and starting point: the model is developed based on Qwen2.5-32B-Instruct, achieving performance improvement from zero to surpassing DeepSeek-R1-Disti.
Course learning:
SFT (Supervised Fine-Tuning): screen data with difficulty levels and perform supervised fine-tuning in two stages. The first stage uses 70,000 data, and the second stage screens out the 3,000 data with the highest difficulty for further fine-tuning.
DPO (Direct Preference Optimization): Based on SFT, based on multiple sampling and preference pair construction, optimize the output quality of the model.
Data processing and deduplication: The training data comes from multiple open source mathematical data sets (such as OpenR1-Math-220k, OpenThoughts-114k, etc.), and is strictly deduplicated to avoid the impact of test data leakage on model performance.
Model fusion: The final Liaht-R1-328 is obtained by integrating SFT stage 2, DPO and another DPO version of the model. Further improve the performance and stability of the model.
Training framework and optimization: Use the 360-LLaMA-factory training framework to support sequential parallelism and efficient distributed training. Based on the optimized training process, Light-R1 can complete training in just 6 hours on 12 H800 machines.
Light-R1 project address
GitHub repository: https://github.com/Qihoo360/Light-R1
HuggingFace model library: https://huggingface.co/collections/gihoo360/light-r1z
Application scenarios of Light-R1
Education: As a mathematics learning tool, it helps students solve complex problems, provides problem-solving steps and ideas, and is suitable for mathematics competitions and daily learning.
Scientific research and academic: Assists in mathematical research and interdisciplinary problem solving, such as physical construction, engineering optimization, etc.
Enterprise application: Used to solve complex problems such as data analysis, risk assessment, and supply chain optimization.
Software integration: Integrate into smart assistants and mathematical software to enhance reasoning and problem-solving functions.
Open source and developers: Support developers to customize and expand models and promote the development of the open source community.