mohtani777/Qwen3_4B_SFT_DPO_agent_v0
mohtani777/Qwen3_4B_SFT_DPO_agent_v0 is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is specifically optimized to enhance reasoning capabilities, particularly Chain-of-Thought, and improve the quality of structured responses. It is designed for tasks requiring aligned and coherent output based on preferred datasets.
Loading preview...
Model Overview
This model, mohtani777/Qwen3_4B_SFT_DPO_agent_v0, is a 4 billion parameter language model derived from the Qwen3-4B-Instruct-2507 base model. It has undergone fine-tuning using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged, eliminating the need for adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning processes.
- Structured Response Quality: Focuses on generating higher quality and more structured outputs.
- Preference Alignment: Aligned with preferred outputs through DPO training over 5 epochs.
Training Details
The model was trained with a learning rate of 1e-06, a beta of 0.05, and a maximum sequence length of 1024. The training utilized the u-10bei/dpo-dataset-qwen-cot dataset, which is licensed under the MIT License. Users must also adhere to the original base model's license terms.
Use Cases
This model is particularly suitable for applications where precise reasoning and well-structured, aligned responses are critical, such as complex question-answering or agentic workflows requiring coherent thought processes.