Name: rmbrain/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: rmbrain

Model Overview

rmbrain/dpo-qwen-cot-merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its LoRA adapters merged into the base model for direct use.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, enabling more logical and step-by-step problem-solving.
Structured Response Quality: DPO training specifically targeted at aligning responses with preferred outputs, leading to higher quality and more structured generated text.
Efficient Deployment: Provided as a full-merged 16-bit model, eliminating the need for adapter loading and simplifying integration with the transformers library.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. It utilized a maximum sequence length of 1024. The training data was sourced from the u-10bei/dpo-dataset-qwen-cot dataset. The model operates under an MIT License, with users also required to comply with the original base model's license terms.

Overview

Model Overview

Key Capabilities

Training Details

Full Model Card (README)