Name: Takashi-0000/dpo-qwen-cot-merged0 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Takashi-0000

Model Overview

This model, Takashi-0000/dpo-qwen-cot-merged0, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model.

Key Capabilities

Enhanced Reasoning: Optimized specifically to improve Chain-of-Thought (CoT) reasoning, leading to more logical and structured responses.
Improved Response Quality: DPO training aligns the model's outputs with preferred examples, enhancing overall response coherence and quality.
Direct Use: As a full-merged model, it can be used directly with the transformers library without requiring adapter loading.

Training Details

The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1, using a maximum sequence length of 1024. The training data, u-10bei/dpo-dataset-qwen-cot, focused on preference alignment for reasoning and structured outputs. The model operates under an MIT License, with compliance required for the original base model's license terms.

Overview

Model Overview

Key Capabilities

Training Details

Full Model Card (README)