Model Overview
The ogwata/exp7-dpo-baseline is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library, focusing on aligning its responses with preferred outputs.
Key Capabilities & Optimization
This model's primary optimization targets include:
- Improved Reasoning: Enhanced ability to generate Chain-of-Thought (CoT) reasoning, leading to more logical and step-by-step problem-solving.
- Structured Response Quality: Optimized for producing high-quality, structured outputs based on the preference dataset used during training.
- DPO Alignment: Leverages DPO to better align model behavior with human preferences, reducing undesirable outputs and increasing helpfulness.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 5e-06 and a beta value of 0.1. It utilized a maximum sequence length of 1024. The LoRA configuration (r=8, alpha=16) was merged into the base model, providing full 16-bit weights without requiring adapter loading.
Ideal Use Cases
This model is particularly well-suited for applications where:
- Reasoning tasks are critical, especially those benefiting from explicit Chain-of-Thought generation.
- Structured and aligned outputs are preferred, such as in data extraction, summarization, or controlled generation scenarios.
- A 4B parameter model is desired for efficient deployment while still offering specialized performance in reasoning and response quality.