Name: Bunemon/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Bunemon

Model Overview

Bunemon/dpo-qwen-cot-merged is a specialized language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library to align its responses with preferred outputs. This model is provided as a full-merged 16-bit weights package, eliminating the need for adapter loading.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, making it suitable for complex problem-solving tasks.
Improved Response Quality: Focuses on generating more structured and aligned responses based on the preference dataset used during DPO training.
Direct Usage: As a merged model, it can be directly integrated and used with the transformers library without additional configuration.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 2e-06 and a beta value of 0.1. It utilized a maximum sequence length of 2048 and a LoRA configuration (r=8, alpha=16) which has been merged into the base model. The training data was sourced from u-10bei/dpo-dataset-qwen-cot.

Licensing

The model operates under the MIT License, consistent with the terms of its training dataset. Users are also required to comply with the original base model's license terms.

Overview

Model Overview

Key Capabilities

Training Details

Licensing

Full Model Card (README)