bam2app/dpo-qwen-cot-merged_v3
The bam2app/dpo-qwen-cot-merged_v3 is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). It is specifically optimized to improve reasoning capabilities through Chain-of-Thought (CoT) and enhance structured response quality. This model is designed for tasks requiring improved logical coherence and adherence to preferred output formats.
Loading preview...
Model Overview
The bam2app/dpo-qwen-cot-merged_v3 is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, resulting in a merged 16-bit weight model that requires no adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, leading to more logical and coherent outputs.
- Structured Response Quality: Fine-tuned to align responses with preferred outputs, enhancing the quality and structure of generated text.
- DPO Alignment: Utilizes DPO to align model behavior with human preferences, focusing on specific response characteristics.
Training Details
The model was trained for 1 epoch with a learning rate of 3e-06 and a beta value of 0.2, using a maximum sequence length of 1024. The training data utilized was [u-10bei/dpo-dataset-qwen-cot].
Good For
- Applications requiring improved logical reasoning and step-by-step thought processes.
- Generating structured and high-quality responses that adhere to specific formats or preferences.
- Tasks where alignment with preferred outputs is critical for performance.