Model Overview
The sfutenma/dpo-qwen3_4b-cot-merged_v260302-093614 is a 4 billion parameter language model based on the Qwen3 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, building upon the sfutenma/lora_structeval_t_qwen3_4b_v260228-172650 model. This release provides the full-merged 16-bit weights, eliminating the need for adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized through DPO to improve Chain-of-Thought (CoT) reasoning abilities.
- Structured Response Quality: Specifically aligned to produce higher quality, structured outputs based on a preference dataset.
- Efficient Deployment: Provided as a fully merged model, ready for direct use with
transformers without additional configuration.
Training Details
The model was trained for 5 epochs with a learning rate of 1e-06 and a beta of 0.1. It utilized a maximum sequence length of 768 tokens during DPO training. The base model for this fine-tuning was unsloth/Qwen3-4B-Instruct-2507. The training data used was u-10bei/dpo-dataset-qwen-cot.
Usage Considerations
This model is ideal for tasks where improved reasoning and structured, aligned responses are critical. Users should be aware that the model's license follows the MIT License, as per the dataset terms, and compliance with the original base model's license terms is also required.