Name: Hi-Satoh/adv_sft_dpo_final_8_merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Hi-Satoh

Model Overview

Hi-Satoh/adv_sft_dpo_final_8_merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library, integrating the full-merged 16-bit weights directly, eliminating the need for adapter loading.

Key Optimizations

This model's primary objective was to enhance its ability to produce preferred outputs, specifically focusing on:

Improved Reasoning: Optimized for Chain-of-Thought (CoT) capabilities.
Structured Response Quality: Enhanced generation of well-structured and aligned text based on preference datasets.

Training Details

The DPO training involved:

Base Model: Qwen/Qwen3-4B-Instruct-2507
Method: Direct Preference Optimization (DPO)
Epochs: 1
Learning Rate: 5e-07
Beta: 0.1
Max Sequence Length: 4096
LoRA Configuration: r=8, alpha=16 (weights merged into the base model)

Usage and Licensing

The model can be loaded using the transformers library with torch.float16 for efficient inference. It was trained on the Hi-Satoh/test_dpo_dataset and is released under the MIT License, with users also required to comply with the original base model's license terms.

Overview

Model Overview

Key Optimizations

Training Details

Usage and Licensing

Full Model Card (README)