Name: nisiwaki/dpo-qwen-cot-merged_01 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: nisiwaki

Overview

This model, nisiwaki/dpo-qwen-cot-merged_01, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, with its 16-bit weights fully merged into the base model for direct deployment.

Key Capabilities

Enhanced Reasoning: Specifically optimized to improve Chain-of-Thought (CoT) reasoning, making it suitable for tasks requiring logical progression and structured thinking.
Improved Response Quality: DPO fine-tuning aligns the model's outputs with preferred responses, leading to higher quality and more aligned generations.
Direct Use: As a fully merged model, it can be used directly with the transformers library without the need for separate adapter loading.

Training Details

The model was trained using a SFT + DPO approach over 3 epochs for each stage, with a learning rate of 1e-05 and a DPO beta of 0.1. The maximum sequence length during training was 1024 tokens. The training utilized the u-10bei/dpo-dataset-qwen-cot dataset, and the model operates under an MIT License, with users also required to comply with the original base model's license terms.

Overview

Overview

Key Capabilities

Training Details

Full Model Card (README)