Name: Hi-Satoh/adv_sft_dpo_final_9_merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Hi-Satoh

Model Overview

Hi-Satoh/adv_sft_dpo_final_9_merged is a 4 billion parameter language model, fine-tuned by Hi-Satoh from the Qwen/Qwen3-4B-Instruct-2507 base model. This model leverages Direct Preference Optimization (DPO), implemented via the Unsloth library, to align its responses with preferred outputs.

Key Optimizations

The primary objective of this DPO fine-tuning was to enhance two critical areas:

Reasoning (Chain-of-Thought): The model has been optimized to produce more coherent and logical step-by-step reasoning processes.
Structured Response Quality: It aims to generate higher quality, well-organized responses, particularly when structured outputs are desired.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length used during training was 4096 tokens. The LoRA configuration (r=8, alpha=16) was merged into the base model, providing full-merged 16-bit weights without requiring adapter loading.

Licensing

This model is released under the MIT License, consistent with the terms of its training dataset. Users must also adhere to the original base model's license terms.

Overview

Model Overview

Key Optimizations

Training Details

Licensing

Full Model Card (README)