HerrHruby/MR_midtrain_9B_v3
HerrHruby/MR_midtrain_9B_v3 is a 9 billion parameter instruction-tuned model based on Qwen3.5-9B, specifically designed for meta-reasoning tasks. It implements a unique MR (propose exploration directions) → E (execute directions, summarize) → FA (final answer) loop using custom special tokens. This model excels at complex problem-solving, demonstrating strong performance on benchmarks like SODA2026, IMO ProofBench, and physics_papers.
Loading preview...
MR_midtrain_9B_v3: A Meta-Reasoning Powerhouse
MR_midtrain_9B_v3 is a 9 billion parameter instruction-tuned model, built upon the Qwen3.5-9B architecture, specifically engineered for advanced meta-reasoning. Developed by HerrHruby, this model integrates a unique three-stage meta-reasoning loop: MR (propose exploration directions), E (execute each direction and emit a summary), and FA (formulate the final answer). This process is facilitated by custom <direction> and <summary> special tokens.
Key Capabilities & Architecture
- Meta-Reasoning Loop: Implements a sophisticated MR→E→FA loop for complex problem-solving, allowing the model to dynamically explore and synthesize information.
- Base Model: Utilizes
Qwen/Qwen3.5-9Bas its foundation. - Architecture: Employs
Qwen3_5ForConditionalGeneration, which is optimized for compatibility with both vLLM serving and verl Megatron (mbridge) RL environments, ensuring broad deployment flexibility.
Performance Highlights
MR_midtrain_9B_v3 demonstrates strong performance across challenging benchmarks, indicating its proficiency in reasoning tasks:
- SODA2026: Achieves a mean score of 0.478.
- IMO ProofBench: Records a pass@1 score of 0.547 and a best@3 score of 0.678 (evaluated by an official Gemini-3.1-Pro judge).
- physics_papers: Attains a pass@1 score of 0.679.
These results indicate that v3 SFT matches or surpasses the best v2 RL checkpoints even before any v3 RL fine-tuning. Users should note that a temperature of 1.0 is recommended for optimal performance, as lower temperatures can lead to repetitive outputs in the 'E' step.