Hi-Satoh/adv_MoE_ALF_sft3_merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 24, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Hi-Satoh/adv_MoE_ALF_sft3_merged is a 4 billion parameter language model fine-tuned from Qwen/Qwen3-4B-Instruct-2507. Utilizing Direct Preference Optimization (DPO) via Unsloth, this model is specifically optimized to enhance reasoning capabilities, particularly Chain-of-Thought, and improve the quality of structured responses. It is designed for applications requiring aligned outputs based on preferred response patterns.

Loading preview...

Overview

This model, Hi-Satoh/adv_MoE_ALF_sft3_merged, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone fine-tuning using Direct Preference Optimization (DPO), implemented with the Unsloth library, to align its outputs with preferred response patterns.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, making it suitable for tasks requiring logical progression and structured thinking.
  • Improved Response Quality: Focuses on generating higher-quality, more aligned structured responses based on preference datasets.
  • Full-Merged Weights: Provided as full-merged 16-bit weights, eliminating the need for adapter loading.

Training Details

The model was trained for 2 epochs with a learning rate of 1e-06 and a beta value of 0.05. The maximum sequence length used during training was 4096 tokens. The LoRA configuration (r=8, alpha=16) was merged into the base model.

Good For

  • Applications requiring models with improved reasoning abilities.
  • Scenarios where structured and aligned responses are critical.
  • Developers looking for a Qwen3-4B variant with DPO-enhanced performance.