Name: KSIMNB/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: KSIMNB

Model Overview

KSIMNB/dpo-qwen-cot-merged is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library, to align its outputs with preferred responses. This model incorporates full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

Enhanced Reasoning (Chain-of-Thought): Optimized through DPO to improve the model's ability to generate logical and step-by-step reasoning processes.
Improved Structured Responses: Focuses on producing higher quality and more aligned structured outputs based on preference data.
Direct Use: As a fully merged model, it can be directly integrated and used with the transformers library without additional configuration for LoRA adapters.

Training Details

The model underwent a single epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The training utilized a maximum sequence length of 1024 tokens. The preference dataset used for training is u-10bei/dpo-dataset-qwen-cot.

When to Use This Model

This model is particularly suitable for use cases where reasoning quality and alignment to preferred response styles are critical. It is ideal for applications requiring coherent, structured, and logically sound text generation, especially in scenarios benefiting from Chain-of-Thought capabilities.

Overview

Model Overview

Key Capabilities

Training Details

When to Use This Model

Full Model Card (README)