Name: ryoto0175/dpo-qwen-cot-merged-v01 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: ryoto0175

Model Overview

This model, ryoto0175/dpo-qwen-cot-merged-v01, is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base architecture. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, resulting in a merged 16-bit weights model that requires no adapter loading.

Key Capabilities & Optimization

The primary objective of this model's training was to enhance its ability to generate preferred outputs, with a specific focus on:

Improved Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, enabling more logical and step-by-step problem-solving.
Structured Response Quality: Aligned to produce higher quality, structured responses based on the provided preference dataset.

Training Details

Methodology: Direct Preference Optimization (DPO).
Dataset: Utilized u-10bei/dpo-dataset-qwen-cot for preference alignment.
Configuration: Trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length used during training was 1024 tokens.

Usage & Licensing

As a fully merged model, it can be directly integrated and used with the transformers library. The model operates under the MIT License, consistent with its training data, and users must also adhere to the original base model's license terms.

Overview

Model Overview

Key Capabilities & Optimization

Training Details

Usage & Licensing

Full Model Card (README)