Name: dormouse2/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: dormouse2

dormouse2/dpo-qwen-cot-merged: DPO-Optimized Qwen3-4B for Enhanced Reasoning

This model is a 4 billion parameter variant of the Qwen3-4B-Instruct-2507 base model, fine-tuned by dormouse2 using Direct Preference Optimization (DPO) via the Unsloth library. The primary objective of this optimization was to align the model's responses with preferred outputs, significantly enhancing its reasoning capabilities (Chain-of-Thought) and the overall quality of structured responses.

Key Capabilities & Features

Enhanced Reasoning: Specifically optimized for Chain-of-Thought (CoT) reasoning, making it suitable for complex problem-solving.
Improved Response Quality: DPO fine-tuning aligns outputs with preferred examples, leading to more coherent and structured answers.
Full-Merged Weights: The repository provides the full-merged 16-bit weights, eliminating the need for adapter loading.
Efficient Training: Fine-tuned with DPO over 1 epoch, utilizing a learning rate of 1e-07 and a beta of 0.1, with a maximum sequence length of 1024.

Good For

Applications requiring strong reasoning and logical deduction.
Generating structured and high-quality responses based on preference data.
Tasks where alignment with specific output styles is crucial.

This model leverages the u-10bei/dpo-dataset-qwen-cot for its DPO training, and is released under the MIT License, with users also required to comply with the original base model's license terms.

Overview

dormouse2/dpo-qwen-cot-merged: DPO-Optimized Qwen3-4B for Enhanced Reasoning

Key Capabilities & Features

Good For

Full Model Card (README)