mohtani777/Qwen3_4B_SFT_DPOv1_DPOv3_agent_v0

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

mohtani777/Qwen3_4B_SFT_DPOv1_DPOv3_agent_v0 is a 4 billion parameter language model fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is specifically optimized to enhance reasoning capabilities, particularly Chain-of-Thought, and improve the quality of structured responses. It is designed for tasks requiring aligned outputs based on preferred data, offering improved performance in complex reasoning scenarios.

Loading preview...

Overview

This model, mohtani777/Qwen3_4B_SFT_DPOv1_DPOv3_agent_v0, is a 4 billion parameter language model derived from the Qwen3-4B-Instruct-2507 base model. It has undergone fine-tuning using Direct Preference Optimization (DPO) via the Unsloth library, resulting in a merged 16-bit weight model that requires no adapter loading.

Key Capabilities & Optimization

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning, enabling more structured and logical problem-solving.
  • Improved Response Quality: Fine-tuned to align responses with preferred outputs, leading to higher quality and more relevant generated text.
  • DPO Training: Utilizes DPO with a specific configuration (5 epochs, 5e-07 learning rate, beta 0.05, max sequence length 1024) to achieve its specialized performance.

Usage & Licensing

This model can be directly used with the transformers library for inference. It is licensed under the MIT License, consistent with its training data source, and users must also adhere to the original base model's license terms.