HerrHruby/MR_midtrain_9B_v3

VISIONConcurrency Cost:1Model Size:9BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Jun 26, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

HerrHruby/MR_midtrain_9B_v3 is a 9 billion parameter instruction-tuned model based on Qwen3.5-9B, specifically designed for meta-reasoning tasks. It implements a unique MR (propose exploration directions) → E (execute directions, summarize) → FA (final answer) loop using custom special tokens. This model excels at complex problem-solving, demonstrating strong performance on benchmarks like SODA2026, IMO ProofBench, and physics_papers.

Loading preview...

MR_midtrain_9B_v3: A Meta-Reasoning Powerhouse

MR_midtrain_9B_v3 is a 9 billion parameter instruction-tuned model, built upon the Qwen3.5-9B architecture, specifically engineered for advanced meta-reasoning. Developed by HerrHruby, this model integrates a unique three-stage meta-reasoning loop: MR (propose exploration directions), E (execute each direction and emit a summary), and FA (formulate the final answer). This process is facilitated by custom <direction> and <summary> special tokens.

Key Capabilities & Architecture

  • Meta-Reasoning Loop: Implements a sophisticated MR→E→FA loop for complex problem-solving, allowing the model to dynamically explore and synthesize information.
  • Base Model: Utilizes Qwen/Qwen3.5-9B as its foundation.
  • Architecture: Employs Qwen3_5ForConditionalGeneration, which is optimized for compatibility with both vLLM serving and verl Megatron (mbridge) RL environments, ensuring broad deployment flexibility.

Performance Highlights

MR_midtrain_9B_v3 demonstrates strong performance across challenging benchmarks, indicating its proficiency in reasoning tasks:

  • SODA2026: Achieves a mean score of 0.478.
  • IMO ProofBench: Records a pass@1 score of 0.547 and a best@3 score of 0.678 (evaluated by an official Gemini-3.1-Pro judge).
  • physics_papers: Attains a pass@1 score of 0.679.

These results indicate that v3 SFT matches or surpasses the best v2 RL checkpoints even before any v3 RL fine-tuning. Users should note that a temperature of 1.0 is recommended for optimal performance, as lower temperatures can lead to repetitive outputs in the 'E' step.