Name: Hi-Satoh/adv_sft_dpo_w_merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Hi-Satoh

Model Overview

Hi-Satoh/adv_sft_dpo_w_merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its full-merged 16-bit weights provided for direct use without adapter loading.

Key Capabilities

Enhanced Reasoning: Optimized through DPO to improve Chain-of-Thought reasoning.
Structured Response Quality: Focuses on generating higher quality, structured outputs based on preference datasets.
Efficient Fine-tuning: Utilizes DPO with a specific configuration (1 epoch, 5e-07 learning rate, beta 0.5, max sequence length 4096) to achieve its alignment goals.

Training Details

Base Model: Qwen/Qwen3-4B-Instruct-2507
Methodology: Direct Preference Optimization (DPO)
Training Data: Utilized the Hi-Satoh/test_dpo_dataset for preference alignment.

Usage Considerations

This model is designed for tasks where improved reasoning and structured, preferred responses are critical. Users should be aware that the model's license follows the MIT License, and compliance with the original base model's license terms is required.

Overview

Model Overview

Key Capabilities

Training Details

Usage Considerations

Full Model Card (README)