Name: stemask2985/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: stemask2985

Model Overview

This model, stemask2985/dpo-qwen-cot-merged, is a 4 billion parameter variant of the Qwen3-4B-Instruct-2507 base model. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, specifically targeting enhanced reasoning and structured output generation.

Key Capabilities & Features

Improved Reasoning (Chain-of-Thought): Optimized to produce more coherent and logical reasoning steps in its responses.
Enhanced Structured Output: Fine-tuned to align responses with preferred formats, improving the quality of structured data generation.
DPO Fine-tuning: Utilizes Direct Preference Optimization for better alignment with human preferences.
Full-Merged Weights: Provided as a 16-bit merged model, eliminating the need for adapter loading during deployment.

Training Details

The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1, using a maximum sequence length of 1024. The training leveraged a specific DPO dataset (u-10bei/dpo-dataset-qwen-cot) focused on Chain-of-Thought examples.

Ideal Use Cases

Applications requiring robust reasoning abilities.
Scenarios where structured and high-quality responses are critical.
Tasks benefiting from preference-aligned language generation.

Overview

Model Overview

Key Capabilities & Features

Training Details

Ideal Use Cases

Full Model Card (README)