Name: kennaka1112/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: kennaka1112

Overview

kennaka1112/dpo-qwen-cot-merged is a 4 billion parameter model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its LoRA adapters merged into the base model for direct use. The training focused on aligning the model's responses with preferred outputs, specifically targeting improvements in reasoning and structured response generation.

Key Capabilities

Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, enabling more logical and step-by-step problem-solving.
Improved Structured Responses: Designed to produce higher quality and more structured outputs based on preference datasets.
Direct Use: Provided as full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment with transformers.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-05 and a beta value of 0.2. A maximum sequence length of 1024 was used during training. The base model's license terms (MIT License) apply to this fine-tuned version.

Good For

Applications requiring strong logical reasoning and Chain-of-Thought capabilities.
Scenarios where structured and high-quality responses are critical.
Developers looking for a readily deployable Qwen3-based model with enhanced reasoning.

Overview

Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)