xx18/Composition-RL-4B

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 12, 2026Architecture:Transformer Warm

xx18/Composition-RL-4B is a 4 billion parameter language model, fine-tuned from Qwen3-8B-Base using the Composition-RL framework. This model specializes in improving reasoning capabilities across mathematical and scientific domains by training on automatically composed, complex, yet verifiable prompts. It leverages Reinforcement Learning with Verifiable Rewards (RLVR) to ensure continuous informative training signals.

Loading preview...

Overview of Composition-RL-4B

xx18/Composition-RL-4B is a 4 billion parameter model, fine-tuned from the Qwen3-8B-Base architecture. It utilizes the Composition-RL framework, a data-efficient Reinforcement Learning with Verifiable Rewards (RLVR) approach, to enhance its reasoning abilities.

Key Capabilities and Training

The core innovation of Composition-RL lies in its ability to automatically compose multiple verifiable problems into a single, more complex prompt. This method addresses the issue of "too-easy" prompts during RL training, ensuring the model consistently receives challenging and informative signals. The model was trained using the MATH-Composition-199K dataset.

What Makes This Model Different?

Unlike standard fine-tuning or RL methods that might struggle with diminishing returns on simpler tasks, Composition-RL ensures continuous learning by dynamically increasing prompt complexity. This leads to improved performance, particularly in:

  • Mathematical reasoning
  • Scientific problem-solving

This approach is detailed in the paper: Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models.