Name: toenobu/utokyo-llm-advance-main-dpo API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: toenobu

Model Overview

This model, toenobu/utokyo-llm-advance-main-dpo, is a specialized fine-tune of the Qwen/Qwen3-4B-Instruct-2507 base model. Developed by toenobu, it leverages Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs. The model incorporates full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities & Training

The primary objective of this DPO fine-tuning was to enhance the model's reasoning abilities, particularly through Chain-of-Thought (CoT), and to improve the overall structured response quality. The training was conducted for 1 epoch with a learning rate of 2e-07 and a beta value of 0.5, using a maximum sequence length of 1536. The LoRA configuration (r=8, alpha=16) was merged into the base model.

Use Cases & Licensing

This model is well-suited for applications where improved logical reasoning and structured, aligned outputs are critical. It can be directly integrated into projects using the transformers library. The training data utilized is u-10bei/dpo-dataset-qwen-cot, and the model is released under the MIT License, with users also required to comply with the original base model's license terms.

Overview

Model Overview

Key Capabilities & Training

Use Cases & Licensing

Full Model Card (README)