Name: Hi-Satoh/adv_sft_dpo_final_1_merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Hi-Satoh

Overview

Hi-Satoh/adv_sft_dpo_final_1_merged is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. It leverages Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs, focusing on quality improvements.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning abilities.
Structured Response Quality: Designed to produce more coherent and structured outputs.
DPO Fine-tuning: Utilizes DPO with a specific preference dataset (Hi-Satoh/test_dpo_dataset) for better alignment.
Full-merged Weights: Contains full-merged 16-bit weights, eliminating the need for adapter loading.

Training Configuration Highlights

Method: Direct Preference Optimization (DPO)
Epochs: 1
Learning Rate: 5e-07
Max Sequence Length: 4096 tokens

Usage Considerations

This model is suitable for applications where improved reasoning and structured, aligned responses are critical. Users should be aware that the model's license follows the MIT License, as per the training data, and compliance with the original base model's license terms is required.