amu870/test-v2.1-dpo
The amu870/test-v2.1-dpo model is a 4 billion parameter language model fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) via the Unsloth library. It is optimized to improve reasoning (Chain-of-Thought) and structured response quality based on a preference dataset. This model provides full-merged 16-bit weights, requiring no adapter loading, and is designed for direct use in applications requiring enhanced response alignment.
Loading preview...
Model Overview
amu870/test-v2.1-dpo is a 4 billion parameter instruction-tuned model, built upon the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO), a method aimed at aligning model responses with preferred human outputs. The fine-tuning process leveraged the Unsloth library and resulted in a full-merged 16-bit model, simplifying deployment as no separate adapter loading is required.
Key Optimizations
This model's primary optimization focus is on enhancing:
- Reasoning capabilities: Specifically, improving Chain-of-Thought processes.
- Structured response quality: Ensuring outputs are well-organized and aligned with desired formats.
These improvements stem from its DPO training against a specific preference dataset, distinguishing it from models trained solely with supervised fine-tuning.
Usage and Integration
As a merged model, amu870/test-v2.1-dpo can be directly loaded and used with the Hugging Face transformers library. It supports standard inference workflows for causal language models, making it straightforward to integrate into existing Python environments. Users should be aware that the model's license terms follow those of the original base model and the training data used.