Name: taka104/qwen3-4b-dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: taka104

taka104/qwen3-4b-dpo-qwen-cot-merged Overview

This model is a fine-tuned variant of the Qwen/Qwen3-4B-Instruct-2507 base model, developed by taka104. It leverages Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs, focusing on improving reasoning and structured response quality.

Key Capabilities

Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, allowing for more logical and step-by-step problem-solving.
Improved Response Quality: DPO fine-tuning aims to produce higher quality and more aligned outputs based on preference datasets.
Direct Use: Provided as full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment with transformers.

Training Details

The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1, using a maximum sequence length of 1024. The training data utilized was u-10bei/dpo-dataset-qwen-cot. The model is released under the MIT License, with compliance required for the original base model's license terms.

Overview

taka104/qwen3-4b-dpo-qwen-cot-merged Overview

Key Capabilities

Training Details

Full Model Card (README)