Name: duong942001/dpo-qwen-cot-merged1 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: duong942001

Model Overview

This model, duong942001/dpo-qwen-cot-merged1, is a 4 billion parameter language model based on Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned by duong942001 using Direct Preference Optimization (DPO) via the Unsloth library. The fine-tuning process focused on aligning the model's responses with preferred outputs, specifically targeting improvements in reasoning (Chain-of-Thought) and the quality of structured responses.

Key Capabilities

Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, aiming for more logical and structured thought processes in its outputs.
Improved Response Quality: Fine-tuned to produce higher quality, preferred outputs, particularly for structured response generation.
Direct Preference Optimization (DPO): Utilizes DPO for alignment, leveraging a preference dataset to guide its learning.
Merged Weights: Provided as a full-merged 16-bit model, eliminating the need for adapter loading and simplifying deployment with transformers.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. It was trained with a maximum sequence length of 1024, using a LoRA configuration (r=8, alpha=16) that has since been merged into the base model. The training data used was [u-10bei/dpo-dataset-qwen-cot].

Good For

Applications requiring strong reasoning capabilities.
Generating structured and high-quality responses based on user preferences.
Developers looking for a ready-to-use, merged Qwen3-based model for inference without complex setup.

Overview

Model Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)