Model Overview
TakaYamamoto/dpo-qwen-cot-merged_biya is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, integrating the full 16-bit weights directly without requiring adapter loading. The primary objective of this optimization was to align the model's responses with preferred outputs, specifically enhancing its reasoning capabilities (Chain-of-Thought) and the overall quality of structured responses.
Key Capabilities
- Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) processes, leading to improved logical progression in responses.
- Structured Output Quality: Fine-tuned to produce higher quality and more structured outputs based on preference datasets.
- Direct Use: As a fully merged model, it can be used directly with
transformers without additional adapter loading. - Qwen3-4B Base: Leverages the robust architecture and capabilities of the Qwen3-4B-Instruct-2507 base model.
Training Details
The model underwent 3 epochs of DPO training with a learning rate of 1e-07 and a beta value of 0.05. It utilized a maximum sequence length of 4096 tokens. The training data used was [u-10bei/dpo-dataset-qwen-cot].
Good For
- Applications requiring improved reasoning and logical coherence.
- Tasks where structured and high-quality responses are critical.
- Developers seeking a Qwen3-4B variant with enhanced alignment to preferred outputs.