hallomee/dpo-qwen-cot-merged
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 20, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The hallomee/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based causal language model, fine-tuned using Direct Preference Optimization (DPO) via Unsloth. It is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve the quality of structured responses. This model is designed for applications requiring improved logical coherence and refined output formatting.

Loading preview...

Model Overview

The hallomee/dpo-qwen-cot-merged is a 4 billion parameter language model built upon the Qwen3 architecture. It has been fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO), leveraging the Unsloth library for efficient training. This model incorporates a merged 16-bit weight configuration, eliminating the need for separate adapter loading.

Key Capabilities

  • Enhanced Reasoning (Chain-of-Thought): Optimized specifically to improve the model's ability to generate logical, step-by-step reasoning processes.
  • Improved Structured Responses: Focuses on delivering higher quality and more coherent structured outputs.
  • DPO Fine-tuning: Benefits from preference-based learning to align responses with desired output characteristics.

Good For

  • Applications requiring robust reasoning abilities.
  • Tasks where structured and high-quality output formatting is crucial.
  • Scenarios benefiting from models fine-tuned with Direct Preference Optimization for better alignment.