Name: sokosokobe/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: sokosokobe

Model Overview

The sokosokobe/dpo-qwen-cot-merged model is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base architecture. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its full 16-bit weights merged for direct use without adapters.

Key Optimizations

This model's primary optimization focuses on enhancing:

Reasoning (Chain-of-Thought): Improved ability to generate logical, step-by-step reasoning processes.
Structured Response Quality: Better coherence and organization in generated outputs, aligning with preferred response formats.

Training Details

The DPO fine-tuning process involved:

Base Model: Qwen/Qwen3-4B-Instruct-2507
Method: Direct Preference Optimization (DPO)
Epochs: 1
Learning Rate: 1e-07
Max Sequence Length: 1024
Training Data: Utilized the u-10bei/dpo-dataset-qwen-cot dataset for preference alignment.

Usage

As a merged model, it can be directly loaded and used with the transformers library for inference. The model operates under the MIT License, with compliance also required for the original base model's license terms.

Overview

Model Overview

Key Optimizations

Training Details

Usage

Full Model Card (README)