rkumagai/dpo-qwen-cot-merged
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 7, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The rkumagai/dpo-qwen-cot-merged model is a 4 billion parameter language model fine-tuned from Qwen/Qwen3-4B-Instruct-2507. It utilizes Direct Preference Optimization (DPO) to enhance reasoning capabilities, specifically Chain-of-Thought (CoT), and improve structured response quality. This model is optimized for generating aligned and coherent outputs based on preferred response patterns, making it suitable for tasks requiring improved logical flow and structured answers.

Loading preview...