Model Overview
ykawasaki/qwen3-4b-dpo-qwen-cot-merged-v7 is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, specifically targeting improved reasoning capabilities through Chain-of-Thought (CoT) and enhanced structured response quality.
Key Characteristics
- Base Model: Qwen/Qwen3-4B-Instruct-2507, a 4 billion parameter Qwen3 variant.
- Fine-tuning Method: Direct Preference Optimization (DPO) for aligning responses with preferred outputs.
- Adapter Integration: Merged with
ykawasaki/qwen3-4b-structured-output-lora-v12 prior to DPO, focusing on structured output. - Training Objective: Optimized to improve reasoning (Chain-of-Thought) and the quality of structured responses based on a preference dataset.
- Configuration: Trained for 3 epochs with a learning rate of 1e-07, beta of 0.1, and a maximum sequence length of 1024. LoRA configuration includes r=16 and alpha=32.
Usage and Licensing
This model is provided as full-merged 16-bit weights, allowing direct use with the transformers library without requiring separate adapter loading. It is licensed under the MIT License, consistent with its training data, and users must also adhere to the original base model's license terms.
Ideal Use Cases
- Applications requiring improved reasoning and Chain-of-Thought capabilities.
- Tasks where structured and high-quality responses are critical.
- Scenarios benefiting from a DPO-tuned model for better alignment with desired outputs.