Model Overview
Umezaki/dpo-qwen-cot-merged is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library, to enhance its response quality and alignment.
Key Capabilities
- Improved Reasoning (Chain-of-Thought): The model's primary optimization target was to enhance its ability to generate logical, step-by-step reasoning processes.
- Structured Response Quality: DPO training focused on aligning the model's outputs with preferred formats and structures, based on a specific preference dataset.
- Full-Merged Weights: This repository provides the full 16-bit merged weights, eliminating the need for adapter loading and simplifying deployment.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. It utilized a maximum sequence length of 1024 during training. The LoRA configuration (r=8, alpha=16) was merged into the base model.
Good For
- Applications requiring enhanced logical reasoning and problem-solving.
- Generating structured outputs that adhere to specific formats.
- Tasks where response alignment and quality are critical.
Usage
As a merged model, it can be directly loaded and used with the transformers library for inference. Users should be aware that the model's license (MIT) is derived from its training data, and compliance with the original base model's license terms is also required.