Name: hiro7ka/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: hiro7ka

Overview

This model, hiro7ka/dpo-qwen-cot-merged, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library to enhance its response quality and alignment. The model incorporates full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

Enhanced Reasoning: Optimized specifically to improve Chain-of-Thought (CoT) reasoning abilities.
Structured Response Quality: Fine-tuned to generate more coherent and well-structured outputs.
DPO Alignment: Utilizes Direct Preference Optimization to align model responses with preferred human outputs.
Efficient Deployment: Provided as a merged model, allowing direct use with the transformers library without additional LoRA adapter loading.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 2e-07 and a beta value of 0.08. It was trained with a maximum sequence length of 1024 tokens, using the u-10bei/dpo-dataset-qwen-cot dataset. The base model's license terms (MIT License as per the dataset) apply.

Use Cases

This model is particularly well-suited for applications requiring:

Improved logical reasoning in responses.
Generation of structured and high-quality text.
Tasks where alignment with preferred outputs is critical.

Overview

Overview

Key Capabilities

Training Details

Use Cases

Full Model Card (README)