Naoto-TAJIMA/dpo-qwen-cot-merged
The Naoto-TAJIMA/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-4B-Instruct-2507 base model fine-tuned using Direct Preference Optimization (DPO) via Unsloth. It is optimized to improve reasoning capabilities, specifically Chain-of-Thought (CoT), and structured response quality. This model provides full-merged 16-bit weights, making it suitable for direct use in applications requiring enhanced logical coherence and structured output.
Loading preview...
Model Overview
This model, dpo-qwen-cot-merged, is a 4 billion parameter language model based on Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library to enhance its performance.
Key Capabilities & Optimization
The primary objective of this DPO fine-tuning was to align the model's responses with preferred outputs, specifically focusing on:
- Improving reasoning abilities through Chain-of-Thought (CoT) processes.
- Enhancing the quality of structured responses based on a preference dataset.
Training Details
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Method: Direct Preference Optimization (DPO)
- Epochs: 1
- Learning Rate: 1e-07
- Max Sequence Length: 1024
- The model provides full-merged 16-bit weights, eliminating the need for adapter loading.
Usage
This model can be directly integrated and used with the transformers library, similar to other merged models. It is licensed under the MIT License, with compliance required for the original base model's license terms.