Name: poko75/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: poko75

Model Overview

poko75/dpo-qwen-cot-merged is a 4 billion parameter language model based on the Qwen3 architecture, specifically fine-tuned from Qwen/Qwen3-4B-Instruct-2507. This model leverages Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs, focusing on improving reasoning capabilities and structured response quality.

Key Capabilities

Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, leading to more logical and step-by-step problem-solving.
Improved Structured Responses: Fine-tuned to generate higher quality, more coherent, and structured outputs based on preference data.
Full-Merged Weights: Provided as full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.

Training Details

The model was trained for 1 epoch with a learning rate of 1e-07 and a beta of 0.1, using a maximum sequence length of 1024. The training utilized the [u-10bei/dpo-dataset-qwen-cot] dataset. The base model's license terms and the MIT License of the dataset apply.

Good For

Applications requiring strong reasoning and logical deduction.
Tasks where structured and high-quality responses are critical.
Developers looking for a Qwen3-based model with enhanced alignment and CoT capabilities.

Overview

Model Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)