stemask2985/dpo-qwen-cot-merged
The stemask2985/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based language model, fine-tuned using Direct Preference Optimization (DPO) via Unsloth. It is specifically optimized to improve reasoning capabilities through Chain-of-Thought (CoT) and enhance structured response quality. This model is designed for applications requiring improved logical coherence and adherence to preferred output formats.
Loading preview...
Model Overview
This model, stemask2985/dpo-qwen-cot-merged, is a 4 billion parameter variant of the Qwen3-4B-Instruct-2507 base model. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, specifically targeting enhanced reasoning and structured output generation.
Key Capabilities & Features
- Improved Reasoning (Chain-of-Thought): Optimized to produce more coherent and logical reasoning steps in its responses.
- Enhanced Structured Output: Fine-tuned to align responses with preferred formats, improving the quality of structured data generation.
- DPO Fine-tuning: Utilizes Direct Preference Optimization for better alignment with human preferences.
- Full-Merged Weights: Provided as a 16-bit merged model, eliminating the need for adapter loading during deployment.
Training Details
The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1, using a maximum sequence length of 1024. The training leveraged a specific DPO dataset (u-10bei/dpo-dataset-qwen-cot) focused on Chain-of-Thought examples.
Ideal Use Cases
- Applications requiring robust reasoning abilities.
- Scenarios where structured and high-quality responses are critical.
- Tasks benefiting from preference-aligned language generation.