ThinkTwice-Qwen3-4B-Instruct Overview

This model, developed by Difan Jiao and collaborators, is a 4 billion parameter instruction-tuned language model based on the Qwen3-4B-Instruct architecture. Its core innovation lies in the ThinkTwice framework, a two-phase GRPO-based training approach that jointly optimizes the model for both solving reasoning problems and refining its own solutions. This framework utilizes a binary correctness reward without requiring explicit correctness signals or critique annotations.

Key Capabilities & Differentiators

Joint Reasoning and Self-Refinement: Uniquely trained to not only generate initial solutions but also to critically evaluate and improve them in a subsequent step.
Implicit Rectify-then-Fortify Curriculum: The ThinkTwice framework fosters a natural progression where the model first corrects errors and then learns to preserve already-correct solutions.
Enhanced Mathematical Reasoning: Achieves notable performance improvements on benchmarks like AIME, outperforming standard GRPO-trained Qwen3-4B by +5 percentage points before refinement and +11.5 percentage points after one self-refinement step (pass@4).
Two-Pass Usage: Designed for a two-step inference process: first, solve the problem; second, refine the initial solution.

Ideal Use Cases

Complex Problem Solving: Excellent for tasks requiring logical deduction and multi-step reasoning, particularly in mathematical domains.
Automated Solution Verification: Suitable for applications where iterative improvement and self-correction of generated answers are beneficial.
Educational Tools: Can be leveraged in systems that help users understand and correct their reasoning processes.

Overview

ThinkTwice-Qwen3-4B-Instruct Overview

Key Capabilities & Differentiators

Ideal Use Cases

Full Model Card (README)