ogwata/exp7-dpo-baseline

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 13, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The ogwata/exp7-dpo-baseline is a 4 billion parameter Qwen3-based causal language model fine-tuned using Direct Preference Optimization (DPO) via Unsloth. This model is specifically optimized to improve reasoning capabilities, particularly Chain-of-Thought, and enhance structured response quality. It is designed for tasks requiring aligned and coherent outputs based on preferred data, offering a specialized alternative to general-purpose LLMs.

Loading preview...

Model Overview

The ogwata/exp7-dpo-baseline is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library, focusing on aligning its responses with preferred outputs.

Key Capabilities & Optimization

This model's primary optimization targets include:

  • Improved Reasoning: Enhanced ability to generate Chain-of-Thought (CoT) reasoning, leading to more logical and step-by-step problem-solving.
  • Structured Response Quality: Optimized for producing high-quality, structured outputs based on the preference dataset used during training.
  • DPO Alignment: Leverages DPO to better align model behavior with human preferences, reducing undesirable outputs and increasing helpfulness.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 5e-06 and a beta value of 0.1. It utilized a maximum sequence length of 1024. The LoRA configuration (r=8, alpha=16) was merged into the base model, providing full 16-bit weights without requiring adapter loading.

Ideal Use Cases

This model is particularly well-suited for applications where:

  • Reasoning tasks are critical, especially those benefiting from explicit Chain-of-Thought generation.
  • Structured and aligned outputs are preferred, such as in data extraction, summarization, or controlled generation scenarios.
  • A 4B parameter model is desired for efficient deployment while still offering specialized performance in reasoning and response quality.