Name: KazumaTsuboi/dpo-qwen-cot-merged_v10 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: KazumaTsuboi

Model Overview

This model, dpo-qwen-cot-merged_v10, is a 4 billion parameter language model developed by KazumaTsuboi. It is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507, utilizing Direct Preference Optimization (DPO) via the Unsloth library.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, leading to more logical and structured outputs.
Improved Response Quality: Fine-tuned to align responses with preferred outputs, focusing on better structured and higher-quality generations.
Direct Preference Optimization (DPO): Leverages DPO for alignment, a method known for effectively incorporating human preferences into model behavior.
Full-Merged Weights: The repository contains full-merged 16-bit weights, eliminating the need for adapter loading during deployment.

Training Details

The model was trained for 1 epoch with a learning rate of 2e-07 and a beta value of 0.05. The maximum sequence length used during training was 1024 tokens. The LoRA configuration (r=16, alpha=16) was merged into the base model.

Good For

Applications requiring models with strong reasoning abilities.
Scenarios where structured and high-quality responses are critical.
Tasks benefiting from models aligned with specific output preferences.

Overview

Model Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)