Name: Hi-Satoh/adv_sft_dpo_final_10_merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Hi-Satoh

Model Overview

Hi-Satoh/adv_sft_dpo_final_10_merged is a 4 billion parameter language model developed by Hi-Satoh. It is a fine-tuned version of the Qwen/Qwen3-4B-Instruct-2507 base model, enhanced through Direct Preference Optimization (DPO) using the Unsloth library. This model provides full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

Improved Reasoning: Optimized to enhance Chain-of-Thought reasoning abilities.
Structured Response Quality: Focuses on generating higher quality, more structured outputs.
Preference Alignment: Aligned with preferred outputs based on a specific preference dataset.

Training Details

The model was trained for 1 epoch with a learning rate of 7e-07 and a beta value of 0.1. The maximum sequence length used during training was 4096 tokens. The LoRA configuration (r=8, alpha=16) was merged into the base model.

Usage Considerations

This model is licensed under the MIT License, as per its training data. Users must also adhere to the original base model's license terms. The training data used for DPO is sourced from Hi-Satoh/test_dpo_dataset.

Good for

Applications requiring enhanced reasoning capabilities.
Generating structured and high-quality text responses.
Use cases where alignment with specific output preferences is crucial.

Overview

Model Overview

Key Capabilities

Training Details

Usage Considerations

Good for

Full Model Card (README)