toenobu/utokyo-llm-advance-main-dpo
The toenobu/utokyo-llm-advance-main-dpo model is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507, developed by toenobu. This 4 billion parameter model utilizes Direct Preference Optimization (DPO) to enhance reasoning capabilities, specifically Chain-of-Thought (CoT), and improve structured response quality. It is optimized for tasks requiring aligned and coherent outputs based on preferred data, making it suitable for applications needing improved logical flow and structured generation.
Loading preview...
Model Overview
This model, toenobu/utokyo-llm-advance-main-dpo, is a specialized fine-tune of the Qwen/Qwen3-4B-Instruct-2507 base model. Developed by toenobu, it leverages Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs. The model incorporates full-merged 16-bit weights, eliminating the need for adapter loading.
Key Capabilities & Training
The primary objective of this DPO fine-tuning was to enhance the model's reasoning abilities, particularly through Chain-of-Thought (CoT), and to improve the overall structured response quality. The training was conducted for 1 epoch with a learning rate of 2e-07 and a beta value of 0.5, using a maximum sequence length of 1536. The LoRA configuration (r=8, alpha=16) was merged into the base model.
Use Cases & Licensing
This model is well-suited for applications where improved logical reasoning and structured, aligned outputs are critical. It can be directly integrated into projects using the transformers library. The training data utilized is u-10bei/dpo-dataset-qwen-cot, and the model is released under the MIT License, with users also required to comply with the original base model's license terms.