Name: Hi-Satoh/adv_sft_dpo_final_4_merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Hi-Satoh

Model Overview

Hi-Satoh/adv_sft_dpo_final_4_merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library to align its responses with preferred outputs. This repository provides the full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning processes.
Structured Response Quality: Focuses on generating higher quality and more structured outputs.
DPO Alignment: Benefits from Direct Preference Optimization for better alignment with desired response characteristics.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 7e-08 and a beta value of 0.5. It utilized a maximum sequence length of 4096 tokens. The LoRA configuration (r=8, alpha=16) was merged into the base model during the fine-tuning process.

Good For

Applications requiring improved reasoning abilities.
Scenarios where structured and high-quality responses are critical.
Use cases benefiting from models aligned through preference-based learning.

Overview

Model Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)