Name: STRV/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: STRV

Model Overview

STRV/dpo-qwen-cot-merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged for direct use without adapter loading.

Key Optimizations

This model's primary objective is to enhance its reasoning capabilities, specifically focusing on Chain-of-Thought (CoT) processes, and to improve the quality of structured responses. This optimization was achieved through DPO training on a preference dataset (u-10bei/dpo-dataset-qwen-cot) over one epoch.

Technical Details

Base Model: Qwen/Qwen3-4B-Instruct-2507
Fine-tuning Method: DPO
Max Sequence Length: 1024 (during training)
License: MIT License (derived from the dataset terms), with compliance to the original base model's license terms.

Usage

As a merged model, it can be directly loaded and used with the transformers library for inference, supporting a 32K context length.

Overview

Model Overview

Key Optimizations

Technical Details

Usage

Full Model Card (README)