Name: thetmon/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: thetmon

Model Overview

The thetmon/dpo-qwen-cot-merged model is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model, eliminating the need for adapter loading.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, enabling more structured and logical problem-solving.
Improved Response Quality: Aligned through DPO to produce preferred outputs, focusing on higher quality and more coherent structured responses.
Direct Usage: As a fully merged model, it can be used directly with the transformers library without additional configuration steps.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 5e-07 and a beta value of 0.1. The maximum sequence length used during training was 2048 tokens. The training utilized a preference dataset specifically designed to improve reasoning and structured output quality.

Usage Considerations

This model is suitable for tasks where robust reasoning and high-quality, structured text generation are critical. Users should adhere to the MIT License terms for the training data and the original base model's license terms for compliance.

Overview

Model Overview

Key Capabilities

Training Details

Usage Considerations

Full Model Card (README)