Name: yumiyumi/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: yumiyumi

Model Overview

The yumiyumi/dpo-qwen-cot-merged model is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model, eliminating the need for adapter loading.

Key Capabilities & Optimization

Enhanced Reasoning: The model's primary optimization objective was to improve its reasoning abilities, particularly through Chain-of-Thought (CoT) processes.
Structured Response Quality: DPO training focused on aligning the model's outputs with preferred responses, leading to better structured and higher-quality generations.
Efficient Fine-tuning: Utilized the Unsloth library for DPO, with a single epoch of training and a low learning rate (1e-07), indicating a targeted and efficient optimization process.
Direct Usage: As a fully merged model, it can be used directly with the transformers library without additional configuration for LoRA adapters.

Training Details

The model was trained on the u-10bei/dpo-dataset-qwen-cot dataset, specifically chosen for preference alignment in reasoning tasks. The training employed a maximum sequence length of 1024 tokens and a DPO beta value of 0.1. The base model's license terms apply, and the training data is under an MIT License.

Good For

Applications requiring improved logical reasoning and step-by-step thought processes.
Scenarios where structured and high-quality responses are critical.
Developers looking for a Qwen3-4B variant with enhanced preference alignment for specific output styles.

Overview

Model Overview

Key Capabilities & Optimization

Training Details

Good For

Full Model Card (README)