The hallomee/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based causal language model, fine-tuned using Direct Preference Optimization (DPO) via Unsloth. It is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve the quality of structured responses. This model is designed for applications requiring improved logical coherence and refined output formatting.
Loading preview...
Model Overview
The hallomee/dpo-qwen-cot-merged is a 4 billion parameter language model built upon the Qwen3 architecture. It has been fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO), leveraging the Unsloth library for efficient training. This model incorporates a merged 16-bit weight configuration, eliminating the need for separate adapter loading.
Key Capabilities
- Enhanced Reasoning (Chain-of-Thought): Optimized specifically to improve the model's ability to generate logical, step-by-step reasoning processes.
- Improved Structured Responses: Focuses on delivering higher quality and more coherent structured outputs.
- DPO Fine-tuning: Benefits from preference-based learning to align responses with desired output characteristics.
Good For
- Applications requiring robust reasoning abilities.
- Tasks where structured and high-quality output formatting is crucial.
- Scenarios benefiting from models fine-tuned with Direct Preference Optimization for better alignment.