Name: Hi-Satoh/adv_sft_dpo_final_11_merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Hi-Satoh

Model Overview

Hi-Satoh/adv_sft_dpo_final_11_merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its LoRA configuration (r=8, alpha=16) fully merged into the base model.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning, enabling more structured and logical response generation.
Improved Response Quality: DPO training aligns the model's outputs with preferred responses, leading to higher quality and more coherent interactions.
Full Merged Weights: The repository provides full-merged 16-bit weights, simplifying deployment as no adapter loading is required.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 3e-07 and a beta value of 0.1. It utilized a maximum sequence length of 4096 during training. The training data used was [Hi-Satoh/test_dpo_dataset].

Licensing

This model operates under the MIT License, as per the dataset terms. Users must also adhere to the original base model's license terms.

Overview

Model Overview

Key Capabilities

Training Details

Licensing

Full Model Card (README)