Name: Hi-Satoh/adv_sft_dpo_final_12_merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Hi-Satoh

Overview

Hi-Satoh/adv_sft_dpo_final_12_merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, specifically targeting improvements in response alignment and quality.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning abilities.
Structured Response Quality: Focuses on generating more structured and preferred outputs.
DPO Fine-tuning: Leverages DPO with a beta of 0.1 and a learning rate of 2e-07 over 1 epoch.
Full-Merged Weights: This repository provides the full-merged 16-bit weights, eliminating the need for adapter loading.

Training Details

The model was trained with a maximum sequence length of 4096 tokens. The LoRA configuration used during training (r=8, alpha=16) has been merged into the base model. The training data utilized is [Hi-Satoh/test_dpo_dataset].

Licensing

This model is released under the MIT License, consistent with the terms of its training dataset. Users must also adhere to the original base model's license terms.

Overview

Overview

Key Capabilities

Training Details

Licensing

Full Model Card (README)