Name: Hi-Satoh/adv_sft5_dpo3_merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Hi-Satoh

Model Overview

Hi-Satoh/adv_sft5_dpo3_merged is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. This model leverages Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs. It is provided as full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning processes.
Structured Response Quality: Focuses on generating higher quality and more structured outputs.
DPO Alignment: Benefits from DPO training to align model behavior with specific preferences.

Training Details

The model was trained for 2 epochs with a learning rate of 1e-06 and a beta value of 0.05. The maximum sequence length used during training was 4096 tokens. The LoRA configuration (r=8, alpha=16) was merged into the base model. The training data used was [Hi-Satoh/test_dpo_dataset].

Usage Considerations

This model is suitable for tasks where improved reasoning and structured, aligned responses are critical. Users should adhere to the MIT License, as per the dataset terms, and comply with the original base model's license terms.

Overview

Model Overview

Key Capabilities

Training Details

Usage Considerations

Full Model Card (README)