Name: TaHiTaHiTa/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: TaHiTaHiTa

Model Overview

TaHiTaHiTa/dpo-qwen-cot-merged is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged, eliminating the need for adapter loading.

Key Capabilities

Enhanced Reasoning: Optimized through DPO to improve Chain-of-Thought (CoT) reasoning, leading to more logical and coherent outputs.
Structured Response Quality: Aligned with preferred outputs to produce higher quality, structured responses.
Efficient Fine-tuning: Utilizes DPO with a specific training configuration (1 epoch, 5e-06 learning rate, beta 0.1, max sequence length 768) on a specialized preference dataset.

Training Details

The model was trained on the u-10bei/dpo-dataset-qwen-cot dataset. The training focused on aligning the model's responses with human preferences, particularly for reasoning and structured output tasks. The LoRA configuration (r=8, alpha=16) was merged into the base model during the fine-tuning process.

Licensing

This model operates under the MIT License, as per the terms of its training dataset. Users are also required to comply with the original base model's license terms.

Overview

Model Overview

Key Capabilities

Training Details

Licensing

Full Model Card (README)