kenzrx/dpo-ori-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 11, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The kenzrx/dpo-ori-qwen-cot-merged model is a 4 billion parameter Qwen3-based instruction-tuned language model, fine-tuned in two stages: Supervised Fine-Tuning (SFT) for high-quality reference answers and Direct Preference Optimization (DPO) for aligning outputs to preferred responses. This model excels at generating structured, aligned responses by optimizing for chosen outputs over rejected ones. It is designed for tasks requiring precise formatting and adherence to desired output structures.

Loading preview...

Model Overview

The kenzrx/dpo-ori-qwen-cot-merged model is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base. It has undergone a two-stage fine-tuning process to enhance its response quality and alignment.

Training Stages

  1. Supervised Fine-Tuning (SFT): Initially, the model was fine-tuned using the structured_data_with_cot_dataset_v2 to learn high-quality reference answers and specific formatting requirements.
  2. Direct Preference Optimization (DPO): Following SFT, the model was further optimized using DPO, leveraging the same structured_data_with_cot_dataset_v2 as a preference dataset. This stage specifically trains the model to prefer "chosen" outputs over "rejected" outputs for a given prompt, significantly improving response alignment and structured quality.

Key Characteristics

  • Full-merged 16-bit weights: No adapter loading is required, simplifying deployment.
  • DPO Alignment: Optimized to produce responses that are aligned with preferred examples, making it suitable for tasks where output structure and quality are critical.
  • Lineage: Derived from Qwen/Qwen3-4B-Instruct-2507, with an intermediate SFT stage (kenzrx/qwen3-4b-sft-merged).

Usage

This model is designed for direct use with the transformers library, supporting standard causal language model inference workflows. Its DPO training makes it particularly effective for generating structured and high-quality text outputs.