Overview
This model, hiro7ka/dpo-qwen-cot-merged-ver3a, is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, specifically targeting improvements in reasoning and structured response generation.
Key Optimizations
The primary objective of this DPO fine-tuning was to align the model's outputs with preferred responses, with a strong emphasis on:
- Chain-of-Thought (CoT) Reasoning: Enhancing the model's ability to generate logical, step-by-step reasoning processes.
- Structured Response Quality: Improving the precision and format of the model's outputs based on a preference dataset.
Technical Details
This repository provides the full-merged 16-bit weights, eliminating the need for separate adapter loading. The training involved 0.5 epochs with a learning rate of 1e-07 and a beta value of 0.1, using a maximum sequence length of 1024. The model is licensed under the MIT License, consistent with its training data source, and users must also adhere to the original base model's license terms.
Usage Considerations
As a merged model, it can be directly integrated and used with the transformers library for inference. It is particularly well-suited for tasks where robust reasoning and well-structured, high-quality outputs are critical.