sfutenma/dpo-qwen3_4b-cot-merged_v260301-220140

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The sfutenma/dpo-qwen3_4b-cot-merged_v260301-220140 model is a 4 billion parameter language model, fine-tuned from sfutenma/lora_structeval_t_qwen3_4b_v260228-172650 using Direct Preference Optimization (DPO) with the Unsloth library. This model is specifically optimized for improving reasoning capabilities through Chain-of-Thought (CoT) and generating high-quality structured responses. It is designed for applications requiring aligned and coherent outputs based on preferred data, supporting a 32768 token context length.

Loading preview...

Model Overview

This model, sfutenma/dpo-qwen3_4b-cot-merged_v260301-220140, is a 4 billion parameter language model derived from sfutenma/lora_structeval_t_qwen3_4b_v260228-172650. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, specifically targeting enhanced reasoning and structured response generation.

Key Capabilities

  • Improved Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, aligning responses with preferred outputs.
  • Structured Response Quality: Enhanced ability to produce high-quality, structured answers based on preference datasets.
  • DPO Fine-tuning: Leverages DPO for better alignment and coherence in generated text.
  • Merged Weights: Provides full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment with transformers.

Training Details

The model was trained for 5 epochs with a learning rate of 2e-05 and a beta value of 0.03. It utilized a maximum sequence length of 768 during training and incorporated LoRA with r=8 and alpha=16, which has been merged into the base model. The training data used was u-10bei/dpo-dataset-qwen-cot.

Good For

  • Applications requiring models with strong reasoning capabilities.
  • Generating structured and aligned text outputs.
  • Use cases where direct preference optimization leads to desired response quality.

License

The model is released under the MIT License, consistent with its training dataset. Users must also adhere to the original base model's license terms.