Name: yusaaihara/llm_dpo API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: yusaaihara

Model Overview

The yusaaihara/llm_dpo is a 4 billion parameter language model, derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model, eliminating the need for adapter loading.

Key Capabilities

Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning processes.
Structured Response Quality: Focuses on generating higher quality, more structured outputs.
Preference Alignment: Aligned with preferred outputs through DPO training on a specific preference dataset.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. It utilized a maximum sequence length of 1024. The training data was sourced from u-10bei/dpo-dataset-qwen-cot.

Usage Considerations

This model is ready for direct use with the transformers library. Users should be aware that the model's license is MIT, as per the dataset terms, and compliance with the original base model's license terms is also required.

Overview

Model Overview

Key Capabilities

Training Details

Usage Considerations

Full Model Card (README)