clzoro/Qwen3.5-9B-Claude-Distill-v2
clzoro/Qwen3.5-9B-Claude-Distill-v2 is a 9 billion parameter language model, fine-tuned by clzoro, based on Qwen3.5-9B with a 32K context length. It leverages Claude-generated training data to significantly enhance instruction-following and reasoning capabilities, particularly excelling in mathematical and coding tasks. This model is optimized for complex problem-solving and precise instruction execution, making it suitable for applications requiring strong logical inference.
Loading preview...
Overview
clzoro/Qwen3.5-9B-Claude-Distill-v2 is a 9 billion parameter model built upon the Qwen3.5-9B base, fine-tuned using a comprehensive dataset of Claude-generated conversations. This full supervised fine-tuning (SFT) process, utilizing 125,175 conversation pairs, aims to imbue the model with advanced instruction-following and reasoning skills while retaining the original capabilities of its base. The training data is heavily weighted towards math (65.5%) and code (15.1%), indicating a strong specialization in these domains.
Key Capabilities
- Enhanced Instruction Following: Improved ability to understand and execute complex instructions due to distillation from Claude data.
- Strong Reasoning: Demonstrates advanced logical inference, particularly in mathematical and coding contexts.
- Default Thinking Mode: Operates with a "thinking mode" by default, generating intermediate thoughts before the final response, which can be disabled for direct answers.
- High Context Length: Supports a substantial context window of 32,768 tokens, beneficial for intricate problems.
Use Cases
This model is particularly well-suited for applications requiring robust performance in:
- Mathematical Problem Solving: Excels at complex calculations and proofs.
- Code Generation and Analysis: Proficient in generating and understanding code, especially Python.
- Complex Instruction Following: Ideal for tasks where precise adherence to multi-step instructions is critical.
Limitations
- Primarily trained on English and Chinese data, with limited performance in other languages.
- The heavy emphasis on math and code in training data may lead to varied performance in other domains.
- As a distilled model, it may inherit biases from the Claude-generated training data and has not undergone explicit safety alignment (e.g., RLHF).