FIPO_32B, developed by Qwen Pilot (Alibaba Group), is a 32 billion parameter model based on Qwen2.5-32B-Base, designed to elicit deeper reasoning capabilities. It utilizes a novel value-free RL recipe called Future-KL Influenced Policy Optimization (FIPO) to provide more granular token credit assignment. This approach enables the model to break through typical length stagnation, extending average chain-of-thought responses from 4,000 to over 10,000 tokens, and improving AIME 2024 Pass@1 accuracy from 50.0% to 58.0%.
Loading preview...
FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization
FIPO_32B, developed by Qwen Pilot (Alibaba Group), is a 32 billion parameter model built upon Qwen2.5-32B-Base. It introduces a novel value-free Reinforcement Learning (RL) recipe called Future-KL Influenced Policy Optimization (FIPO) to enhance deep reasoning capabilities. Unlike standard RL methods that use coarse token credit assignment, FIPO employs a discounted Future-KL term to provide a more granular signal, reflecting how the rest of the trajectory evolves after each token.
Key Capabilities
- Deeper Reasoning: FIPO significantly extends the average chain-of-thought length from a typical 4,000 tokens to over 10,000 tokens, enabling more complex and sustained reasoning.
- Enhanced Performance: Achieves a peak AIME 2024 Pass@1 accuracy of 58.0%, improving upon a 50.0% baseline and outperforming reproduced pure-RL baselines like DAPO and DeepSeek-R1-Zero-32B, and surpassing o1-mini.
- Dense Advantage Formulation: Utilizes a unique method to reweight each token by the discounted signed shift of its future trajectory, leading to more effective learning.
- Pure RL Approach: Demonstrates strong performance using only pure RL training, without relying on value functions.
Good For
- Tasks requiring extended, multi-step reasoning and complex problem-solving.
- Applications where deep, coherent chain-of-thought generation is critical.
- Research and development in advanced RL techniques for language models.
- Benchmarks and evaluations focused on mathematical and logical reasoning, such as AIME 2024.