The yusaaihara/llm_dpo model is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is specifically optimized to enhance reasoning capabilities through Chain-of-Thought and improve structured response quality. It is designed for applications requiring aligned, high-quality text generation based on preferred outputs.
Loading preview...
Model Overview
The yusaaihara/llm_dpo is a 4 billion parameter language model, derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model, eliminating the need for adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning processes.
- Structured Response Quality: Focuses on generating higher quality, more structured outputs.
- Preference Alignment: Aligned with preferred outputs through DPO training on a specific preference dataset.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. It utilized a maximum sequence length of 1024. The training data was sourced from u-10bei/dpo-dataset-qwen-cot.
Usage Considerations
This model is ready for direct use with the transformers library. Users should be aware that the model's license is MIT, as per the dataset terms, and compliance with the original base model's license terms is also required.