TSerizawa/llm-lecture-2025_sft-dpo-qwen-cot-merged-model

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 4, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

TSerizawa/llm-lecture-2025_sft-dpo-qwen-cot-merged-model is a 4 billion parameter Qwen3-based instruction-tuned language model, fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library. This model is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve structured response quality. It features a 40960 token context length and is designed for direct use without adapter loading, making it suitable for applications requiring improved logical coherence and structured output.

Loading preview...

Model Overview

The TSerizawa/llm-lecture-2025_sft-dpo-qwen-cot-merged-model is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, integrating the LoRA configuration directly into the base weights. This means the model can be used immediately with transformers without requiring separate adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized specifically to improve Chain-of-Thought (CoT) reasoning, leading to more logical and coherent responses.
  • Structured Output Quality: Focuses on generating higher quality structured responses, aligning with preferred output formats.
  • Direct Use: Provided as a full-merged 16-bit model, simplifying deployment and inference.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1, using a maximum sequence length of 1024. The training data utilized for preference alignment is sourced from [u-10bei/dpo-dataset-qwen-cot].

Good For

This model is particularly well-suited for applications where improved reasoning, logical consistency, and structured output are critical. Its DPO fine-tuning makes it a strong candidate for tasks requiring nuanced understanding and generation of complex responses, especially those benefiting from Chain-of-Thought prompting.