lichangh20/qwen3-4b-instruct-sft-swegym-iter1
The Qwen3-4B-Instruct-2507 model by Qwen is a 4.0 billion parameter causal language model with a native context length of 262,144 tokens. This instruction-tuned model, an updated version of the Qwen3-4B non-thinking mode, demonstrates significant improvements in instruction following, logical reasoning, mathematics, coding, and long-tail knowledge across multiple languages. It excels in subjective and open-ended tasks, providing more helpful responses and higher-quality text generation, and is specifically designed to operate without generating internal 'thinking' blocks.
Loading preview...
Overview
Qwen3-4B-Instruct-2507 is an updated 4.0 billion parameter instruction-tuned causal language model from Qwen, featuring a substantial native context length of 262,144 tokens. This iteration, building on the Qwen3-4B non-thinking mode, focuses on direct instruction following without generating internal thought processes. It has undergone significant enhancements across various domains, including general capabilities, long-tail knowledge coverage, and user alignment for subjective tasks.
Key Capabilities
- General Capabilities: Demonstrates significant improvements in instruction following, logical reasoning, text comprehension, mathematics, science, coding, and tool usage.
- Long-Context Understanding: Features enhanced capabilities in processing and understanding long contexts up to 256K tokens.
- Multilingualism: Shows substantial gains in long-tail knowledge coverage across multiple languages.
- Subjective Task Alignment: Markedly better alignment with user preferences in subjective and open-ended tasks, leading to more helpful and higher-quality text generation.
- Agentic Use: Excels in tool calling capabilities, with recommendations to use Qwen-Agent for optimal performance.
Performance Highlights
The model shows strong performance across various benchmarks, often outperforming its predecessor, Qwen3-4B Non-Thinking, and in some cases, even larger models like Qwen3-30B-A3B Non-Thinking and GPT-4.1-nano-2025-04-14. Notable improvements include:
- Knowledge: Achieves 69.6 on MMLU-Pro and 84.2 on MMLU-Redux.
- Reasoning: Scores 47.4 on AIME25 and 80.2 on ZebraLogic.
- Coding: Reaches 35.1 on LiveCodeBench v6 and 76.8 on MultiPL-E.
- Alignment: Scores 83.5 on Creative Writing v3 and 83.4 on WritingBench.
Good For
- Applications requiring strong instruction following and logical reasoning.
- Tasks benefiting from extensive context understanding (up to 262,144 tokens).
- Multilingual applications and tasks requiring broad knowledge coverage.
- Subjective and open-ended text generation where user preference alignment is crucial.
- Agentic workflows and tool-calling scenarios.