Name: Hi-Satoh/adv_MoE_sft3_dpo_merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Hi-Satoh

Overview

Hi-Satoh/adv_MoE_sft3_dpo_merged is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. This model distinguishes itself through its application of Direct Preference Optimization (DPO), implemented using the Unsloth library, to align its responses with preferred outputs. It is provided with full-merged 16-bit weights, meaning no adapter loading is required for deployment.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning processes.
Structured Response Quality: Focuses on delivering higher quality, more structured outputs.
Preference Alignment: Fine-tuned using DPO to align model behavior with specific preference datasets.

Training Details

The model underwent 4 epochs of DPO training with a learning rate of 1e-05 and a beta value of 0.2. The maximum sequence length used during training was 4096 tokens. The LoRA configuration (r=8, alpha=16) was merged into the base model, resulting in the provided full-merged weights.

Good for

Applications requiring models with improved reasoning capabilities.
Tasks where structured and coherent responses are critical.
Scenarios benefiting from a model fine-tuned for specific output preferences.

Overview

Overview

Key Capabilities

Training Details

Good for

Full Model Card (README)