The amu870/PiG-v0-dpo model is a 4 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) via Unsloth. This model is specifically optimized to improve reasoning capabilities, particularly Chain-of-Thought, and enhance structured response quality. It is designed for applications requiring aligned and coherent outputs based on preferred response patterns.
Loading preview...
Overview
amu870/PiG-v0-dpo is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. It leverages Direct Preference Optimization (DPO), implemented with the Unsloth library, to align its responses with preferred outputs. This model is provided with full-merged 16-bit weights, eliminating the need for adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning, leading to more structured and logical outputs.
- Improved Response Quality: Fine-tuned using a preference dataset to enhance the overall quality and alignment of generated responses.
- Direct Preference Optimization (DPO): Utilizes DPO for effective alignment, focusing on preferred output patterns.
Training Details
The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1. It used a maximum sequence length of 1024 during training. The training data source is the u-10bei/dpo-dataset-qwen-cot dataset.
Usage
As a merged model, amu870/PiG-v0-dpo can be directly used with the transformers library for inference, simplifying deployment. Users should adhere to the MIT License of the training data and the original base model's license terms.