koguma-ai/sft-dpo-qwen-cot-merged0207

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 7, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The koguma-ai/sft-dpo-qwen-cot-merged0207 is a 4 billion parameter Qwen3-based instruction-tuned causal language model developed by koguma-ai, fine-tuned using a two-stage SFT and DPO pipeline. This model is specifically optimized for structured output generation and Chain-of-Thought (CoT) reasoning, leveraging a 40960 token context length. It excels at producing coherent, reasoned responses by learning from datasets designed for structured data and preference alignment. The model provides full-merged 16-bit weights for direct use with the Hugging Face Transformers library.

Loading preview...

Overview

The koguma-ai/sft-dpo-qwen-cot-merged0207 is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 architecture. Developed by koguma-ai, this model undergoes a specialized two-stage training pipeline to enhance its reasoning and output capabilities: Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO).

Key Capabilities

  • Structured Output Generation: The SFT stage specifically trains the model to produce outputs in a structured format.
  • Chain-of-Thought (CoT) Reasoning: Fine-tuned to generate step-by-step reasoning, improving the transparency and accuracy of its responses.
  • Preference Alignment: DPO training further refines the model's outputs based on preferred responses, leading to more aligned and high-quality generations.
  • Direct Usage: Provided as full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment with the transformers library.

Training Details

The model's training involved:

  • SFT Stage: Utilized the u-10bei/structured_data_with_cot_dataset_512_v2 dataset with a LoRA configuration (r=64, alpha=128) and an assistant-only loss strategy with CoT masking.
  • DPO Stage: Applied a new LoRA adapter (r=8, alpha=16) and trained on the u-10bei/dpo-dataset-qwen-cot dataset to align with preferred outputs.

Good For

  • Applications requiring structured data extraction or generation.
  • Tasks benefiting from explicit reasoning steps (Chain-of-Thought).
  • Scenarios where high-quality, preference-aligned responses are crucial.