Name: HIllaryMori/qwen3-sft-dpo-combined_exp1 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: HIllaryMori

Overview

HillaryMori/qwen3-sft-dpo-combined_exp1 is an experimental fine-tuned language model based on Qwen/Qwen3-4B-Instruct-2507. It leverages Direct Preference Optimization (DPO) with the Unsloth library to enhance its performance. This model is distributed with full-merged 16-bit weights, simplifying deployment as it eliminates the need for adapter loading.

Key Capabilities

Improved Reasoning: Optimized to enhance Chain-of-Thought reasoning abilities.
Structured Responses: Focuses on generating higher quality structured outputs.
DPO Fine-tuning: Utilizes Direct Preference Optimization for alignment with preferred response patterns.

Training Details

The model was trained for 0.5 epochs with a learning rate of 1e-07 and a beta value of 0.5, using a maximum sequence length of 1024. The LoRA configuration (r=8, alpha=16) was merged into the base model. The training data used for DPO was [u-10bei/dpo-dataset-qwen-cot].

Usage Considerations

This repository represents experimental results from an LLM fine-tuning competition. Users should be aware of its experimental nature. The model is licensed under the MIT License, and users must also comply with the original base model's license terms.

Overview

Overview

Key Capabilities

Training Details

Usage Considerations

Full Model Card (README)