Name: Okada0311/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Okada0311

Model Overview

Okada0311/dpo-qwen-cot-merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library, integrating the full 16-bit weights directly into the base model, eliminating the need for adapter loading.

Key Capabilities

Enhanced Reasoning: Optimized through DPO to improve Chain-of-Thought (CoT) reasoning, enabling more logical and step-by-step problem-solving.
Improved Response Quality: Focuses on generating higher quality, more structured, and aligned outputs based on a preference dataset.
Direct Use: As a fully merged model, it can be used directly with the transformers library without additional configuration for LoRA adapters.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 5e-06 and a beta value of 0.1. It utilized a maximum sequence length of 512 tokens and was trained on the u-10bei/dpo-dataset-qwen-cot dataset. The license for this model follows the MIT License, with users also required to comply with the original base model's license terms.

Good For

Applications requiring strong reasoning and logical inference.
Generating structured and coherent text responses.
Tasks where response quality and alignment with preferred outputs are critical.

Overview

Model Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)