XKilin/DPO_v1_20260207

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 7, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

XKilin/DPO_v1_20260207 is a 4 billion parameter Qwen3-based causal language model fine-tuned by XKilin using Direct Preference Optimization (DPO). It features a 40960-token context length and is specifically optimized for improving reasoning (Chain-of-Thought) and structured response quality. This model is designed for applications requiring aligned and high-quality text generation.

Loading preview...

Model Overview

XKilin/DPO_v1_20260207 is a 4 billion parameter language model developed by XKilin, based on the Qwen3-4B-Instruct-2507 architecture. This model has undergone fine-tuning using Direct Preference Optimization (DPO) via the Unsloth library, resulting in a full-merged 16-bit weight model that requires no adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized through DPO to improve Chain-of-Thought reasoning abilities.
  • Structured Response Quality: Aligned to produce higher quality and more structured outputs based on preference datasets.
  • Direct Use: Provided as a merged model, ready for direct integration with transformers.

Training Details

The model was trained for 2 epochs with a learning rate of 1e-04 and a beta of 0.1. It utilized a maximum sequence length of 1024 during DPO training. The training data included the [u-10bei/dpo-dataset-qwen-cot] dataset. The model is released under the MIT License, with users also required to comply with the original base model's license terms.