Name: rk611/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: rk611

Model Overview

The rk611/dpo-qwen-cot-merged model is a 4 billion parameter language model based on the Qwen/Qwen3-4B-Instruct-2507 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model, eliminating the need for adapter loading.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, enabling more logical and structured problem-solving.
Aligned Responses: DPO training aligns the model's outputs with preferred responses, leading to higher quality and more relevant generations.
Structured Output: Focuses on generating well-structured and coherent responses based on the preference dataset used during training.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length used during training was 1024 tokens. The training utilized the u-10bei/dpo-dataset-qwen-cot dataset, and the model is released under the MIT License, adhering to the original base model's license terms.

Good For

Applications requiring improved logical reasoning and step-by-step thought processes.
Use cases where response quality and alignment with specific preferences are critical.
Generating structured and coherent text outputs.

Overview

Model Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)