ManniX-ITA/Qwen3.5-4B-M1-Dare-Ties

VISIONConcurrency Cost:1Model Size:4.5BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Apr 30, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

ManniX-ITA/Qwen3.5-4B-M1-Dare-Ties is a 4.5 billion parameter language model based on the Qwen3.5-4B architecture, created by ManniX-ITA. This model is a vanilla DARE-TIES merge of Qwen3.5-4B with two distilled fine-tunes, serving as a baseline in a comparative study on coding benchmarks. It features a 32768 token context length and is part of an investigation into merge recipes and importance-signal weighting for improved coding performance.

Loading preview...

Overview

ManniX-ITA/Qwen3.5-4B-M1-Dare-Ties is a 4.5 billion parameter model derived from the Qwen3.5-4B base, developed by ManniX-ITA. It represents the M1 variant in a series of models exploring different merging techniques and importance-signal weighting for coding tasks. This specific model is a vanilla DARE-TIES merge, combining two distinct distillation fine-tunes: Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-v2 and Crownelius/Crow-4B-Opus-4.6-Distill-Heretic_Qwen3.5, with weights of 0.55 and 0.45 respectively.

Performance on Coding Benchmarks

This model was evaluated against its base and source models using llama-server and lm_eval on HumanEval and MBPP benchmarks. Key findings include:

  • HumanEval pass@1: The M1 variant achieved 51.22%, which is lower than the Qwen3.5-4B base model's 60.37%. The study noted that no merge in the comparison surpassed the base model on HumanEval.
  • MBPP pass@1: M1 scored 47.00%, showing a slight improvement over one source (Jackrong-v2 at 45.00%) but slightly below the other (Crow-4B at 48.20%). Other merge variants (M4-v2, M5) demonstrated better MBPP performance, indicating that merging can enhance MBPP capability, often at the expense of HumanEval scores.

Context and Purpose

Qwen3.5-4B-M1-Dare-Ties is primarily a research artifact, serving as a baseline for a broader study on the impact of merge recipes and importance-signal weighting on model performance, particularly for coding. Its 32768 token context length is consistent across the evaluated variants.