Name: Hi-Satoh/adv_sft_dpo_final_13_merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Hi-Satoh

Model Overview

This model, developed by Hi-Satoh, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its full 16-bit weights merged into the base model.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning, enabling more logical and step-by-step responses.
Structured Output Quality: Focuses on generating higher quality, more structured responses based on preference datasets.
DPO Alignment: Utilizes DPO to align model outputs with preferred examples, leading to more desirable and controlled text generation.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. It was trained with a maximum sequence length of 4096 tokens, using a LoRA configuration (r=8, alpha=16) that was subsequently merged. The training data used is [Hi-Satoh/test_dpo_dataset].

Licensing

This model is released under the MIT License, consistent with the dataset terms. Users must also adhere to the original base model's license terms.

Overview

Model Overview

Key Capabilities

Training Details

Licensing

Full Model Card (README)