Name: OguraHiroyuki/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: OguraHiroyuki

Model Overview

This model, OguraHiroyuki/dpo-qwen-cot-merged, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, incorporating a specific preference dataset to guide its learning.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, enabling more structured and logical problem-solving.
Improved Response Quality: Focuses on generating higher-quality, aligned, and structured outputs based on preferred examples.
Full-Merged Weights: Provided as full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.

Training Details

The model underwent DPO training for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1. It was trained with a maximum sequence length of 1024, utilizing a LoRA configuration (r=8, alpha=16) that was subsequently merged into the base model. The training data used is sourced from [u-10bei/dpo-dataset-qwen-cot].

Usage

This model can be directly integrated and used with the transformers library for inference, supporting a context length of 32768 tokens. Users should adhere to the MIT License of the training data and the original base model's license terms.

Overview

Model Overview

Key Capabilities

Training Details

Usage

Full Model Card (README)