Name: sfutenma/dpo-qwen3_4b-cot-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: sfutenma

Model Overview

The sfutenma/dpo-qwen3_4b-cot-merged model is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, specifically targeting improvements in reasoning and structured response generation. This model incorporates the full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

Enhanced Reasoning (Chain-of-Thought): Optimized to produce more logical and step-by-step reasoning in its outputs.
Improved Structured Responses: Fine-tuned to generate higher quality, well-organized, and coherent text.
Direct Preference Optimization (DPO): Utilizes a preference-based alignment method to better align model responses with desired output characteristics.

Training Details

The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1, using a maximum sequence length of 1024. The LoRA configuration (r=8, alpha=16) was merged into the base model. The training data used for DPO was the u-10bei/dpo-dataset-qwen-cot dataset.

Usage Considerations

As a merged model, it can be directly integrated and used with the transformers library without additional adapter loading. Users should adhere to the MIT License of the training data and the original base model's license terms.

Overview

Model Overview

Key Capabilities

Training Details

Usage Considerations

Full Model Card (README)