Name: q-hisa/dpo-qwen-cot-merged-v5 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: q-hisa

Overview

This model, q-hisa/dpo-qwen-cot-merged-v5, is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base. It has been meticulously fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library to align its responses with preferred outputs. The training process involved an initial Supervised Fine-Tuning (SFT) phase using a LoRA adapter, followed by DPO to further refine its capabilities.

Key Capabilities

Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, enabling more structured and logical problem-solving.
Improved Structured Responses: Designed to produce higher quality, more coherent, and aligned outputs based on preference datasets.
Direct Preference Optimization (DPO): Utilizes DPO to align model behavior with human preferences, leading to more desirable and helpful responses.
Full-Merged Weights: The repository provides full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.

Training Details

The model underwent DPO training for 1 epoch with a learning rate of 1e-07 and a beta value of 0.05, using a maximum sequence length of 1024. The training data utilized was u-10bei/dpo-dataset-qwen-cot. The model is released under the MIT License, with users also required to comply with the original base model's license terms.

Good For

Applications requiring robust reasoning capabilities.
Generating structured and high-quality text responses.
Tasks where alignment with preferred outputs is crucial.

Overview

Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)