Hi-Satoh/sft-base4-dpo-e2-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 7, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Hi-Satoh/sft-base4-dpo-e2-qwen-cot-merged is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507. Developed by Hi-Satoh, this model utilizes Direct Preference Optimization (DPO) via Unsloth to enhance its reasoning and alignment capabilities. It is specifically optimized for tasks requiring improved reasoning, building upon domain-specific knowledge acquired during its Supervised Fine-Tuning (SFT) phase.

Loading preview...

Overview

Hi-Satoh/sft-base4-dpo-e2-qwen-cot-merged is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. This model has undergone a multi-stage training process, beginning with Supervised Fine-Tuning (SFT) and subsequently enhanced with Direct Preference Optimization (DPO) using the Unsloth library. The primary objective of this fine-tuning was to significantly improve the model's reasoning and alignment capabilities.

Key Capabilities

  • Enhanced Reasoning: The model's DPO phase specifically targeted improvements in reasoning, building on domain-specific knowledge from its SFT phase.
  • Optimized Alignment: DPO training contributes to better alignment with desired outputs and user preferences.
  • Full-Merged Weights: This repository provides the full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.
  • Qwen3-4B-Instruct-2507 Base: Leverages the robust architecture and initial capabilities of the Qwen3-4B-Instruct-2507 model.

Training Details

The model's training involved:

  • SFT Phase: Initial training on the u-10bei/structured_data_with_cot_dataset_512_v5 dataset to acquire domain-specific knowledge.
  • DPO Phase: Applied Direct Preference Optimization for 2 epochs with a learning rate of 1e-07 and a beta of 0.1, using u-10bei/dpo-dataset-qwen-cot as the training data. The maximum sequence length for DPO was 2048 tokens.

Good For

  • Applications requiring a 4B parameter model with improved reasoning and alignment.
  • Developers looking for a readily deployable Qwen3-4B variant with enhanced performance characteristics.
  • Use cases benefiting from a model fine-tuned with DPO for better preference alignment.