Name: tmaoshima/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: tmaoshima

Model Overview

This model, tmaoshima/dpo-qwen-cot-merged, is a 4 billion parameter language model based on Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model.

Key Capabilities & Optimization

Enhanced Reasoning: Optimized specifically to improve Chain-of-Thought (CoT) reasoning, leading to more logical and structured responses.
Improved Response Quality: DPO training aligns the model's outputs with preferred examples, enhancing the overall quality and coherence of generated text.
Direct Usage: As a fully merged model, it can be used directly with the transformers library without requiring adapter loading.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 2e-07 and a beta value of 0.02. It utilized a maximum sequence length of 512 tokens during training. The training data, u-10bei/dpo-dataset-qwen-cot, was instrumental in guiding the preference optimization process.

Licensing

Users must adhere to the MIT License as per the dataset terms and comply with the original base model's license terms.

Overview

Model Overview

Key Capabilities & Optimization

Training Details

Licensing

Full Model Card (README)