ykrh/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 23, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The ykrh/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based instruction-tuned causal language model developed by ykrh. It has been fine-tuned using Direct Preference Optimization (DPO) to enhance reasoning capabilities, particularly Chain-of-Thought (CoT), and improve structured response quality. This model is optimized for tasks requiring coherent reasoning and well-structured outputs, leveraging its 32768 token context length.

Loading preview...

Model Overview

The ykrh/dpo-qwen-cot-merged model is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base architecture. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, resulting in a merged 16-bit weight model that requires no adapter loading.

Key Capabilities & Training

This model is specifically optimized to improve reasoning (Chain-of-Thought) and the quality of structured responses. The DPO fine-tuning process aligned the model's outputs with preferred examples from the u-10bei/dpo-dataset-qwen-cot dataset. Training involved 1 epoch with a learning rate of 5e-07 and a maximum sequence length of 1024, utilizing LoRA configuration (r=8, alpha=16) which has been merged into the base weights.

Usage & Licensing

As a fully merged model, it can be directly used with the transformers library for inference. Users should be aware that while the training data is under an MIT License, compliance with the original base model's license terms is also required.