KazumaTsuboi/dpo-qwen-cot-merged_v10
KazumaTsuboi/dpo-qwen-cot-merged_v10 is a 4 billion parameter causal language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve the quality of structured responses. It is designed for applications requiring improved logical coherence and adherence to preferred output formats.
Loading preview...
Model Overview
This model, dpo-qwen-cot-merged_v10, is a 4 billion parameter language model developed by KazumaTsuboi. It is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507, utilizing Direct Preference Optimization (DPO) via the Unsloth library.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, leading to more logical and structured outputs.
- Improved Response Quality: Fine-tuned to align responses with preferred outputs, focusing on better structured and higher-quality generations.
- Direct Preference Optimization (DPO): Leverages DPO for alignment, a method known for effectively incorporating human preferences into model behavior.
- Full-Merged Weights: The repository contains full-merged 16-bit weights, eliminating the need for adapter loading during deployment.
Training Details
The model was trained for 1 epoch with a learning rate of 2e-07 and a beta value of 0.05. The maximum sequence length used during training was 1024 tokens. The LoRA configuration (r=16, alpha=16) was merged into the base model.
Good For
- Applications requiring models with strong reasoning abilities.
- Scenarios where structured and high-quality responses are critical.
- Tasks benefiting from models aligned with specific output preferences.