ogwata/exp11-sft-dpo-beta02

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 18, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

ogwata/exp11-sft-dpo-beta02 is a 4 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is specifically optimized to improve reasoning capabilities (Chain-of-Thought) and generate high-quality structured responses. It is suitable for applications requiring enhanced logical coherence and precise output formatting.

Loading preview...

Model Overview

ogwata/exp11-sft-dpo-beta02 is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, resulting in a merged 16-bit weight model that requires no adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning, enabling more logical and coherent response generation.
  • Structured Output Quality: Fine-tuned to produce higher quality structured responses, making it suitable for tasks requiring specific output formats.
  • Direct Use: As a full-merged model, it can be used directly with the transformers library for inference.

Training Details

The model was trained for 1 epoch with a learning rate of 5e-07 and a beta value of 0.2, using a maximum sequence length of 1024. The DPO training utilized the u-10bei/dpo-dataset-qwen-cot dataset, which focuses on preference alignment for reasoning and structured responses.

Licensing

The model operates under the MIT License, consistent with the terms of its training dataset. Users must also adhere to the original base model's license terms.