sallm/dpo_qm3_3_step20_qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 15, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The sallm/dpo_qm3_3_step20_qwen-cot-merged model is a 4 billion parameter language model based on Qwen/Qwen3-4B-Instruct-2507, fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library. It is specifically optimized to enhance reasoning capabilities, particularly Chain-of-Thought (CoT), and improve the quality of structured responses. This model is provided with full-merged 16-bit weights, eliminating the need for adapter loading, and is suitable for applications requiring improved logical coherence and structured output.

Loading preview...

Overview

sallm/dpo_qm3_3_step20_qwen-cot-merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, specifically targeting an improvement in reasoning abilities, particularly Chain-of-Thought (CoT), and the generation of higher-quality structured responses. This model is distributed with its full 16-bit weights merged, simplifying deployment as no separate adapter loading is required.

Key Capabilities

  • Enhanced Reasoning: Optimized for better logical progression and Chain-of-Thought capabilities.
  • Improved Structured Responses: Designed to produce more coherent and well-formed structured outputs.
  • Simplified Deployment: Provided as a fully merged model, ready for direct use with transformers without LoRA adapter management.

Good for

  • Applications requiring robust reasoning and problem-solving.
  • Tasks where structured and logically sound outputs are critical.
  • Developers seeking a Qwen3-4B variant with enhanced CoT and response quality through DPO.