dstaka/dpo-qwen-cot-merged
The dstaka/dpo-qwen-cot-merged model is a 4 billion parameter language model fine-tuned from Qwen/Qwen3-4B-Instruct-2507. It utilizes Direct Preference Optimization (DPO) via Unsloth to enhance reasoning capabilities, specifically Chain-of-Thought (CoT), and improve structured response quality. This model is optimized for generating aligned and coherent outputs, making it suitable for tasks requiring improved logical progression and structured answers.
Loading preview...
Model Overview
The dstaka/dpo-qwen-cot-merged model is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. It leverages Direct Preference Optimization (DPO), implemented using the Unsloth library, to align its responses with preferred outputs. This model comes as full-merged 16-bit weights, eliminating the need for adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, leading to more logical and structured outputs.
- Improved Response Quality: Fine-tuned to produce higher quality, aligned responses based on a preference dataset.
- Direct Use: As a merged model, it can be used directly with the
transformerslibrary without additional configuration.
Training Details
The model was trained for 1 epoch with a learning rate of 1e-05 and a beta value of 0.3, using a maximum sequence length of 1024. The training data utilized was u-10bei/dpo-dataset-qwen-cot.
Good For
- Applications requiring improved reasoning and structured output generation.
- Tasks where response alignment and coherence are critical.
- Developers seeking a readily deployable, DPO-optimized Qwen3-4B variant.