Hi-Satoh/adv_sft3J_dpo_merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 22, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Hi-Satoh/adv_sft3J_dpo_merged is a 4 billion parameter instruction-tuned causal language model developed by Hi-Satoh, fine-tuned from Qwen/Qwen3-4B-Instruct-2507. This model utilizes Direct Preference Optimization (DPO) to enhance its reasoning capabilities, particularly for Chain-of-Thought (CoT) processes, and improve structured response quality. It is optimized for generating aligned and coherent outputs based on preferred data, making it suitable for tasks requiring improved logical flow and structured answers.

Loading preview...

Model Overview

This model, Hi-Satoh/adv_sft3J_dpo_merged, is a 4 billion parameter language model developed by Hi-Satoh. It is a fine-tuned version of the Qwen/Qwen3-4B-Instruct-2507 base model, specifically optimized using Direct Preference Optimization (DPO) via the Unsloth library. The repository provides the full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, leading to more logical and step-by-step responses.
  • Improved Structured Output: Focuses on generating higher quality structured responses by aligning with preferred outputs from the training dataset.
  • DPO Fine-tuning: Leverages DPO with a beta of 0.05 and a learning rate of 1e-06 over 2 epochs, using a maximum sequence length of 4096.

Good For

  • Applications requiring models with strong reasoning abilities.
  • Tasks where structured and coherent output is crucial.
  • Developers looking for a Qwen3-4B variant with enhanced alignment and response quality through DPO.

Licensing

The model operates under the MIT License, as per the training dataset terms. Users must also adhere to the original base model's license terms.