Model Overview
KS150/testDPO is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs. This model provides full-merged 16-bit weights, eliminating the need for adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning capabilities.
- Structured Response Quality: Focuses on delivering higher quality and more structured outputs.
- Direct Preference Optimization: Utilizes DPO for better alignment with desired response patterns.
Training Details
The model underwent 3 epochs of DPO training with a learning rate of 7e-04 and a beta value of 0.1. The training utilized a maximum sequence length of 256 and incorporated LoRA configuration (r=8, alpha=16) which has been merged into the base model. The training data used is u-10bei/dpo-dataset-qwen-cot.
Usage
As a merged model, KS150/testDPO can be directly used with the transformers library for inference, supporting torch.float16 and device_map="auto" for efficient deployment. The model is released under the MIT License, with users also required to comply with the original base model's license terms.