HidekiKawai/dpo-qwen-cot-merged
HidekiKawai/dpo-qwen-cot-merged is a fine-tuned Qwen-based language model, optimized using Direct Preference Optimization (DPO) via Unsloth. This model focuses on enhancing reasoning capabilities through Chain-of-Thought (CoT) and improving structured response quality. It is provided as a full-merged 16-bit model, ready for direct use in applications requiring aligned and coherent text generation.
Loading preview...
Overview
This model, HidekiKawai/dpo-qwen-cot-merged, is a fine-tuned version of HidekiKawai/sft-qwen-merged. It leverages Direct Preference Optimization (DPO) with the Unsloth library to align its responses with preferred outputs.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, leading to more structured and logical outputs.
- Improved Response Quality: Fine-tuned to produce higher quality, aligned responses based on a preference dataset.
- Direct Use: Provided as a full-merged 16-bit model, eliminating the need for adapter loading and simplifying deployment with
transformers.
Training Details
- Base Model:
HidekiKawai/sft-qwen-merged - Optimization Method: DPO (Direct Preference Optimization)
- Epochs: 3
- Learning Rate: 2e-05
- Max Sequence Length: 1024
- Training Data: Utilizes the u-10bei/dpo-dataset-qwen-cot dataset for preference alignment.
Usage
This model can be directly loaded and used with the transformers library for inference, as it contains the merged 16-bit weights. Users should ensure compliance with the MIT License and the original base model's license terms.