OguraHiroyuki/dpo-qwen-cot-mergedv4

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 24, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

OguraHiroyuki/dpo-qwen-cot-mergedv4 is a fine-tuned Qwen3-4B-Instruct-2507 model, optimized using Direct Preference Optimization (DPO) via Unsloth. This 4 billion parameter model focuses on improving reasoning through Chain-of-Thought (CoT) and enhancing structured response quality. It is designed for applications requiring aligned and coherent text generation, particularly in conversational AI and instruction following.

Loading preview...

Model Overview

OguraHiroyuki/dpo-qwen-cot-mergedv4 is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. It leverages Direct Preference Optimization (DPO), implemented with the Unsloth library, to align its outputs with preferred responses.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning abilities.
  • Structured Response Quality: Focuses on generating higher quality, more structured outputs based on preference datasets.
  • Instruction Following: Designed for better adherence to instructions, making it suitable for conversational and task-oriented AI.

Training Details

The model was trained for 1 epoch with a learning rate of 1e-06 and a beta value of 0.1, using a maximum sequence length of 1024. The training utilized the u-10bei/dpo-dataset-qwen-cot dataset. The LoRA configuration (r=8, alpha=16) was merged into the base model, providing full 16-bit weights without requiring adapter loading.

Usage

This merged model can be directly used with the transformers library, simplifying deployment for inference tasks. It is licensed under the MIT License, with users also required to comply with the original base model's license terms.