Qwen/Qwen3.5-4B-Base
Qwen/Qwen3.5-4B-Base is a 4.5 billion parameter causal language model developed by Qwen, featuring a unified vision-language foundation and an efficient hybrid architecture. This model integrates breakthroughs in multimodal learning and architectural efficiency, supporting a native context length of 262,144 tokens. It is designed for fine-tuning, in-context learning, and research, excelling in cross-generational parity with Qwen3 and outperforming Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks.
Loading preview...
Qwen3.5-4B-Base Overview
Qwen3.5-4B-Base is a 4.5 billion parameter causal language model developed by Qwen, built upon a unified vision-language foundation. This model represents a significant advancement, integrating multimodal learning, architectural efficiency, and scalable reinforcement learning. It is primarily intended for fine-tuning, in-context learning experiments, and other research or development purposes, rather than direct interactive use.
Key Capabilities and Enhancements
- Unified Vision-Language Foundation: Achieves strong performance across reasoning, coding, agent tasks, and visual understanding benchmarks, demonstrating cross-generational parity with Qwen3 and surpassing Qwen3-VL models.
- Efficient Hybrid Architecture: Utilizes Gated Delta Networks combined with sparse Mixture-of-Experts for high-throughput inference with optimized latency and cost.
- Scalable RL Generalization: Features reinforcement learning scaled across millions of agent environments, enhancing robust real-world adaptability.
- Global Linguistic Coverage: Expanded support for 201 languages and dialects, facilitating inclusive worldwide deployment.
- Next-Generation Training Infrastructure: Achieves near-100% multimodal training efficiency compared to text-only training, supported by asynchronous RL frameworks.
Technical Specifications
This model has a native context length of 262,144 tokens, extensible up to 1,010,000 tokens. It incorporates a Gated DeltaNet and Gated Attention mechanism within its 32 layers. The model's design allows for efficient LoRA-style PEFT, mitigating the need to fine-tune embeddings, which is a significant optimization given its larger vocabulary.
For more details, refer to the Qwen3.5 blog post.