Name: reiwa7/dpo-qwen-cot-merged-s250 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: reiwa7

Model Overview

This model, reiwa7/dpo-qwen-cot-merged-s250, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, with its LoRA adapters (r=8, alpha=16) fully merged into the base model for direct use without additional loading.

Key Capabilities & Optimization

The primary objective of this DPO fine-tuning was to align the model's responses with preferred outputs, specifically focusing on:

Enhanced Reasoning: Significant improvement in Chain-of-Thought (CoT) capabilities, allowing for more logical and step-by-step problem-solving.
Structured Response Quality: Optimization for generating higher quality and more structured outputs based on preference datasets.

Training Details

Base Model: Qwen/Qwen3-4B-Instruct-2507
Methodology: Direct Preference Optimization (DPO)
Training Data: Utilized the u-10bei/dpo-dataset-qwen-cot dataset.
Configuration: Trained for 1 epoch with a learning rate of 5e-05 and a beta value of 0.067. The maximum sequence length during training was 1024 tokens.

Usage Considerations

As a fully merged model, it can be directly integrated and used with the transformers library. The model operates under an MIT License, inherited from its training data, and users must also comply with the original base model's license terms.

Overview

Model Overview

Key Capabilities & Optimization

Training Details

Usage Considerations

Full Model Card (README)