Name: OkamotoJP/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: OkamotoJP

Model Overview

This model, OkamotoJP/dpo-qwen-cot-merged, is a 4 billion parameter language model based on Unsloth/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library to enhance its response quality and alignment.

Key Capabilities

Improved Reasoning: Optimized to generate better Chain-of-Thought (CoT) reasoning, leading to more logical and structured outputs.
Preference Alignment: Trained with DPO to align its responses with preferred examples, resulting in higher quality and more desirable outputs.
Direct Use: Provided as a full-merged 16-bit weight model, eliminating the need for adapter loading and simplifying deployment with transformers.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. It utilized a maximum sequence length of 1024 and incorporated LoRA configurations (r=8, alpha=16) which have been merged into the base model. The training data used was u-10bei/dpo-dataset-qwen-cot.

Licensing

This model is released under the MIT License, consistent with its training dataset. Users must also adhere to the original base model's license terms.

Overview

Model Overview

Key Capabilities

Training Details

Licensing

Full Model Card (README)