QwenPilot/FIPO_32B

TEXT GENERATIONConcurrency Cost:2Model Size:32.8BQuant:FP8Ctx Length:32kPublished:Mar 22, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

QwenPilot/FIPO_32B is a 32 billion parameter language model developed by Qwen Pilot, Alibaba Group, based on the Qwen2.5-32B-Base architecture. It utilizes Future-KL Influenced Policy Optimization (FIPO), a value-free reinforcement learning method, to elicit deeper reasoning. This model is specifically designed to extend chain-of-thought reasoning length beyond 10,000 tokens and significantly improve performance on complex reasoning tasks like AIME 2024.

Loading preview...

Overview

QwenPilot/FIPO_32B is a 32 billion parameter model developed by Qwen Pilot, Alibaba Group, focusing on enhancing deep reasoning capabilities through a novel reinforcement learning approach. Based on the Qwen2.5-32B-Base architecture, FIPO (Future-KL Influenced Policy Optimization) introduces a dense advantage formulation that reweights each token by the discounted signed shift of its future trajectory, moving beyond coarse outcome-level signals.

Key Capabilities & Differentiators

  • Pure RL Optimization: FIPO demonstrates superior performance compared to reproduced pure-RL baselines like DAPO and DeepSeek-R1-Zero-32B, and surpasses o1-mini on the AIME 2024 benchmark.
  • Extended Reasoning Depth: It effectively breaks the typical 4,000-token reasoning length plateau, extending average chain-of-thought reasoning to over 10,000 tokens.
  • Enhanced Performance: This extended reasoning directly translates to stronger performance, with AIME 2024 Pass@1 accuracy improving from 50.0% to a peak of 58.0%.
  • Future-KL Influenced Policy Optimization: The core innovation lies in its value-free RL recipe, which uses a discounted Future-KL term to provide granular reinforcement signals, enabling the model to utilize additional length as genuine reasoning depth.

Good For

  • Complex Reasoning Tasks: Ideal for applications requiring deep, multi-step reasoning, such as advanced mathematical problem-solving or scientific inquiry.
  • Long Chain-of-Thought Generation: Suitable for scenarios where extended, coherent, and logically sound reasoning chains are critical for accurate outputs.
  • Research in RL for LLMs: Offers a strong baseline and innovative approach for researchers exploring reinforcement learning techniques to improve language model reasoning.