Name: pokke11/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: pokke11

Overview

pokke11/dpo-qwen-cot-merged is a specialized language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, aiming to align its outputs with preferred responses. This model integrates the full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, making it suitable for complex problem-solving tasks.
Structured Response Quality: Fine-tuned to produce higher quality and more structured outputs, based on a preference dataset.
Direct Use: As a fully merged model, it can be used directly with the transformers library without additional configuration.

Training Details

The model was trained for 1 epoch with a learning rate of 5e-07 and a beta value of 0.1, using a maximum sequence length of 1024. The training utilized the u-10bei/dpo-dataset-qwen-cot dataset. The underlying license is MIT, with compliance required for the original base model's terms.

When to Use This Model

This model is particularly well-suited for applications where:

Improved reasoning abilities, especially Chain-of-Thought, are critical.
High-quality, aligned, and structured responses are preferred.
Direct integration into existing transformers workflows is desired without managing LoRA adapters.

Overview

Overview

Key Capabilities

Training Details

When to Use This Model

Full Model Card (README)