Name: KhaledScience/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: KhaledScience

Model Overview

KhaledScience/dpo-qwen-cot-merged is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. It leverages Direct Preference Optimization (DPO) via the Unsloth library to enhance its response quality and alignment.

Key Capabilities

Improved Reasoning: Specifically optimized to enhance Chain-of-Thought (CoT) reasoning, leading to more structured and logical outputs.
Aligned Responses: DPO training aligns the model's outputs with preferred examples, improving overall response quality.
Direct Use: Provided as a full-merged 16-bit model, requiring no adapter loading for direct integration with transformers.

Training Details

Methodology: Utilizes DPO with a beta of 0.1 and a learning rate of 1e-07 over 1 epoch.
Dataset: Trained on the u-10bei/dpo-dataset-qwen-cot preference dataset.
Context Length: Supports a maximum sequence length of 1024 tokens during training.

Licensing

This model is released under the MIT License, consistent with the terms of its training data. Users must also adhere to the original base model's license terms.

Overview

Model Overview

Key Capabilities

Training Details

Licensing

Full Model Card (README)