Name: nyannto/dpo-qwen-cot-merged13 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: nyannto

Model Overview

nyannto/dpo-qwen-cot-merged13 is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. It utilizes Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs, focusing on enhancing reasoning and structured response quality.

Key Capabilities

Improved Reasoning (Chain-of-Thought): Optimized to generate more logical and step-by-step reasoning processes.
Enhanced Structured Responses: Aligned to produce higher quality, well-organized outputs based on preference data.
DPO Fine-tuning: Benefits from DPO for better alignment with human preferences.
Full-merged 16-bit weights: Ready for direct use with transformers without requiring adapter loading.

Training Details

The model was trained for 1 epoch with a learning rate of 2e-05 and a maximum sequence length of 768, using the u-10bei/dpo-dataset-qwen-cot dataset. The LoRA configuration (r=8, alpha=16) was merged into the base model.

Usage Considerations

This model is suitable for applications where coherent reasoning and structured, aligned outputs are critical. Users should adhere to the MIT License of the training data and the original base model's license terms.

Overview

Model Overview

Key Capabilities

Training Details

Usage Considerations

Full Model Card (README)