Name: KazumaTsuboi/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: KazumaTsuboi

Model Overview

KazumaTsuboi/dpo-qwen-cot-merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its LoRA adapters merged into the base model for direct use.

Key Optimizations

This model's primary optimization focused on improving two critical areas:

Reasoning (Chain-of-Thought): Enhanced ability to generate logical, step-by-step thought processes.
Structured Response Quality: Improved coherence and formatting of outputs based on preference datasets.

Training Details

The DPO training was conducted for 1 epoch with a learning rate of 1e-06 and a beta value of 0.05. The maximum sequence length used during training was 1024 tokens. The model is provided as full-merged 16-bit weights, eliminating the need for separate adapter loading.

Intended Use

This model is suitable for applications where robust reasoning and high-quality, structured outputs are paramount, particularly in tasks benefiting from Chain-of-Thought prompting.

Overview

Model Overview

Key Optimizations

Training Details

Intended Use

Full Model Card (README)