FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

FIPO_32B, developed by Qwen Pilot (Alibaba Group), is a 32 billion parameter model built upon Qwen2.5-32B-Base. It introduces a novel value-free Reinforcement Learning (RL) recipe called Future-KL Influenced Policy Optimization (FIPO) to enhance deep reasoning capabilities. Unlike standard RL methods that use coarse token credit assignment, FIPO employs a discounted Future-KL term to provide a more granular signal, reflecting how the rest of the trajectory evolves after each token.

Key Capabilities

Deeper Reasoning: FIPO significantly extends the average chain-of-thought length from a typical 4,000 tokens to over 10,000 tokens, enabling more complex and sustained reasoning.
Enhanced Performance: Achieves a peak AIME 2024 Pass@1 accuracy of 58.0%, improving upon a 50.0% baseline and outperforming reproduced pure-RL baselines like DAPO and DeepSeek-R1-Zero-32B, and surpassing o1-mini.
Dense Advantage Formulation: Utilizes a unique method to reweight each token by the discounted signed shift of its future trajectory, leading to more effective learning.
Pure RL Approach: Demonstrates strong performance using only pure RL training, without relying on value functions.

Good For

Tasks requiring extended, multi-step reasoning and complex problem-solving.
Applications where deep, coherent chain-of-thought generation is critical.
Research and development in advanced RL techniques for language models.
Benchmarks and evaluations focused on mathematical and logical reasoning, such as AIME 2024.

Overview

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

Key Capabilities

Good For

Full Model Card (README)