Name: Hi-Satoh/adv_sft_dpo_final_7_merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Hi-Satoh

Model Overview

Hi-Satoh/adv_sft_dpo_final_7_merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model.

Key Optimizations

This model's primary objective during training was to enhance its ability to generate reasoned responses (Chain-of-Thought) and produce high-quality structured outputs. This was achieved by aligning the model's behavior with preferred examples through DPO, utilizing a specific preference dataset.

Training Details

Base Model: Qwen/Qwen3-4B-Instruct-2507
Methodology: Direct Preference Optimization (DPO)
Epochs: 1
Learning Rate: 1e-06
Beta: 0.1
Maximum Sequence Length: 4096 tokens
LoRA Configuration: r=8, alpha=16 (merged)

Intended Use Cases

This model is particularly well-suited for applications where improved reasoning, coherent thought processes, and structured output generation are critical. Its DPO-based fine-tuning aims to provide more aligned and preferred responses compared to its base model, making it valuable for tasks requiring nuanced and well-organized text generation.

Overview

Model Overview

Key Optimizations

Training Details

Intended Use Cases

Full Model Card (README)