chiyum609/FIPO_32B
TEXT GENERATIONConcurrency Cost:2Model Size:32.8BQuant:FP8Ctx Length:32kPublished:Mar 20, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

FIPO_32B, developed by Qwen Pilot (Alibaba Group), is a 32 billion parameter model based on Qwen2.5-32B-Base, designed to elicit deeper reasoning capabilities. It utilizes a novel value-free RL recipe called Future-KL Influenced Policy Optimization (FIPO) to provide more granular token credit assignment. This approach enables the model to break through typical length stagnation, extending average chain-of-thought responses from 4,000 to over 10,000 tokens, and improving AIME 2024 Pass@1 accuracy from 50.0% to 58.0%.

Loading preview...

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

FIPO_32B, developed by Qwen Pilot (Alibaba Group), is a 32 billion parameter model built upon Qwen2.5-32B-Base. It introduces a novel value-free Reinforcement Learning (RL) recipe called Future-KL Influenced Policy Optimization (FIPO) to enhance deep reasoning capabilities. Unlike standard RL methods that use coarse token credit assignment, FIPO employs a discounted Future-KL term to provide a more granular signal, reflecting how the rest of the trajectory evolves after each token.

Key Capabilities

  • Deeper Reasoning: FIPO significantly extends the average chain-of-thought length from a typical 4,000 tokens to over 10,000 tokens, enabling more complex and sustained reasoning.
  • Enhanced Performance: Achieves a peak AIME 2024 Pass@1 accuracy of 58.0%, improving upon a 50.0% baseline and outperforming reproduced pure-RL baselines like DAPO and DeepSeek-R1-Zero-32B, and surpassing o1-mini.
  • Dense Advantage Formulation: Utilizes a unique method to reweight each token by the discounted signed shift of its future trajectory, leading to more effective learning.
  • Pure RL Approach: Demonstrates strong performance using only pure RL training, without relying on value functions.

Good For

  • Tasks requiring extended, multi-step reasoning and complex problem-solving.
  • Applications where deep, coherent chain-of-thought generation is critical.
  • Research and development in advanced RL techniques for language models.
  • Benchmarks and evaluations focused on mathematical and logical reasoning, such as AIME 2024.