alibidaran/Qwen_COG_Thinker_Merged

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:Mar 31, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

alibidaran/Qwen_COG_Thinker_Merged is a fine-tuned Qwen2.5 model developed by alibidaran, specifically trained with Group Relative Policy Optimization (GRPO) to enforce structured reasoning. Unlike models that simulate reasoning via pattern matching, this model builds a verifiable cognitive path through mandatory planning, monitoring, and evaluation stages. It is designed for tasks requiring explicit, step-by-step logical deductions and self-verification, ensuring responses adhere to a strict reasoning protocol.

Loading preview...

Qwen_COG_Thinker_Merged: Structured Reasoning with GRPO

This model, developed by alibidaran, is a fine-tuned version of Qwen2.5 that leverages Group Relative Policy Optimization (GRPO) to enforce a unique structured reasoning process. Instead of merely pattern-matching, it constructs a "real cognitive path" for every response, ensuring verifiable, step-by-step logic.

Key Capabilities & Differentiators

  • Enforced Structured Reasoning: Responses are mandated to follow a three-stage protocol: <planning>, <monitoring>, and <evaluation>, baked in via RL, not just a bolted-on chain-of-thought.
  • Self-Verification: The model performs internal verification before committing to an answer, with invalid structures leading to rejected responses.
  • Strict Output Format: Adheres to a precise system prompt that dictates the structure, minimum reasoning lengths, and forbids generic phrases, ensuring explicit calculations and logical deductions.
  • Isolated Final Answer: The ultimate output is presented cleanly in an <output> section, separate from the detailed reasoning.

Performance Insights

Evaluated on a subset of MMLU, the model demonstrates varying accuracy across subjects, including 50% in College Mathematics, 67% in Medicine, and 83% in Psychology, reflecting its ability to apply structured reasoning to diverse academic and professional domains.

Ideal Use Cases

This model is particularly well-suited for applications where:

  • Verifiable Reasoning is Critical: Tasks requiring transparent, step-by-step logical deductions, calculations, or problem-solving.
  • Strict Output Adherence is Necessary: Scenarios where the response format must be rigorously controlled and validated.
  • Reduced Hallucinations from Pattern Matching: When a deeper, more explicit reasoning process is preferred over superficial pattern recognition.