Name: rokugatsu/dpo-qwen-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: rokugatsu

Model Overview

The rokugatsu/dpo-qwen-cot-merged model is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base. It has been fine-tuned by rokugatsu using Direct Preference Optimization (DPO), leveraging the Unsloth library to align its responses with preferred outputs. This repository provides the full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, leading to more logical and coherent outputs.
Improved Structured Responses: Focuses on enhancing the quality of structured responses based on a preference dataset.
Direct Preference Optimization (DPO): Utilizes DPO for alignment, aiming for better human preference adherence.

Training Details

Base Model: Qwen/Qwen3-4B-Instruct-2507
Methodology: Direct Preference Optimization (DPO)
Training Data: u-10bei/dpo-dataset-qwen-cot
Configuration: Trained for 1 epoch with a learning rate of 1e-07 and a max sequence length of 1024.

Usage Considerations

This model is ready for direct use with the transformers library, as it contains merged weights. Users must adhere to the MIT License of the training data and the original base model's license terms.

Overview

Model Overview

Key Capabilities

Training Details

Usage Considerations

Full Model Card (README)