Name: hifill/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: hifill

Model Overview

The hifill/dpo-qwen-cot-merged model is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base architecture. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged for direct use without adapters.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, enabling more logical and step-by-step problem-solving.
Structured Response Generation: Aligned to produce higher quality, more structured outputs based on preferred response patterns.
DPO Fine-tuning: Leverages DPO to align model behavior with human preferences, focusing on specific response characteristics.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta of 0.1. It utilized a maximum sequence length of 1024 and incorporated LoRA configuration (r=8, alpha=16) which was subsequently merged into the base model. The training data used for DPO was u-10bei/dpo-dataset-qwen-cot.

When to Use This Model

This model is particularly well-suited for applications where improved reasoning, logical coherence, and structured output are critical. It can be beneficial for tasks requiring detailed explanations, step-by-step problem-solving, or generating responses that adhere to specific formats. Users should be aware that the model's license follows the MIT License, as per the dataset terms, and compliance with the original base model's license is also required.

Overview

Model Overview

Key Capabilities

Training Details

When to Use This Model

Full Model Card (README)