Overview
This model, mohtani777/Qwen3_4B_SFT_DPOv3_agent_v0_LR1E7, is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base. It has been fine-tuned by mohtani777 using Direct Preference Optimization (DPO) via the Unsloth library. The fine-tuning process involved 5 epochs with a learning rate of 1e-07 and a beta of 0.05, targeting improved response alignment with preferred outputs.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning abilities.
- Structured Responses: Focuses on generating higher quality, structured outputs.
- DPO Fine-tuning: Leverages Direct Preference Optimization for better alignment with human preferences.
- Full-Merged Weights: Provides full-merged 16-bit weights, eliminating the need for adapter loading.
Good For
- Applications requiring models with refined reasoning skills.
- Use cases where structured and high-quality conversational responses are critical.
- Developers looking for a Qwen3-based model with DPO-enhanced performance in agentic or instructional contexts.
Technical Details
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Optimization Method: DPO
- Max Sequence Length: 1024
- License: MIT License (derived from the dataset terms), with compliance to the original base model's license terms.