Momix-44/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

VISIONConcurrency Cost:1Model Size:9BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Mar 31, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

The voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is a 9 billion parameter causal language model developed by voidful, based on the Qwen/Qwen3.5-9B architecture. This model is specifically optimized for reasoning-oriented multiple-choice performance, demonstrating improved scores on benchmarks like ARC-Challenge, ARC-Easy, and BoolQ. It aims to enhance structured reasoning and calibrated answer selection while maintaining strong general capabilities, making it suitable for applications requiring robust zero-shot reasoning.

Loading preview...

Model Overview

The voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is a 9 billion parameter causal language model developed by voidful, fine-tuned from the Qwen/Qwen3.5-9B base model. Its primary focus is to enhance reasoning-oriented multiple-choice performance while preserving strong general capabilities.

Key Capabilities & Performance

This model demonstrates significant improvements in zero-shot evaluation, achieving the best overall aggregate performance among its compared baselines (Qwen/Qwen3.5-9B and DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT).

  • Enhanced Reasoning: Shows clear gains on reasoning benchmarks such as ARC-Challenge (0.5836 acc_norm) and ARC-Easy (0.7950 acc_norm).
  • Strong BoolQ Performance: Outperforms DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT on BoolQ (0.8792 acc).
  • Overall Aggregate Score: Achieves a representative 7-task average score of 0.7133 and a macro average over 12 metrics of 0.6749, indicating strong overall performance in its class.

Trade-offs & Limitations

While excelling in reasoning, the model is slightly weaker than some baselines on tasks like HellaSwag and OpenBookQA. Its gains are more pronounced on structured reasoning and calibrated answer selection rather than tasks relying heavily on narrative continuation or broad commonsense completion. The evaluation is based on a limited set of zero-shot benchmarks, and further assessment for instruction following, long-context reasoning, coding, and multilingual performance is needed.

Ideal Use Cases

This model is a strong choice for developers focused on:

  • Reasoning-oriented zero-shot performance in a compact 9B model.
  • Applications requiring structured answer selection and multiple-choice QA.
  • Tasks involving science and reading-style benchmarks where calibrated responses are crucial.