KS150/testDPO

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 27, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

KS150/testDPO is a 4 billion parameter Qwen3-based instruction-tuned language model, fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library. This model is specifically optimized to improve reasoning capabilities through Chain-of-Thought (CoT) and enhance structured response quality. It is designed for applications requiring aligned, high-quality outputs in reasoning tasks.

Loading preview...

Model Overview

KS150/testDPO is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs. This model provides full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning capabilities.
  • Structured Response Quality: Focuses on delivering higher quality and more structured outputs.
  • Direct Preference Optimization: Utilizes DPO for better alignment with desired response patterns.

Training Details

The model underwent 3 epochs of DPO training with a learning rate of 7e-04 and a beta value of 0.1. The training utilized a maximum sequence length of 256 and incorporated LoRA configuration (r=8, alpha=16) which has been merged into the base model. The training data used is u-10bei/dpo-dataset-qwen-cot.

Usage

As a merged model, KS150/testDPO can be directly used with the transformers library for inference, supporting torch.float16 and device_map="auto" for efficient deployment. The model is released under the MIT License, with users also required to comply with the original base model's license terms.