mohtani777/Qwen3_4B_SFT_DPOv3_agent_v0_LR5E7

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 28, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

mohtani777/Qwen3_4B_SFT_DPOv3_agent_v0_LR5E7 is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) with the Unsloth library. This model is specifically optimized to improve reasoning capabilities through Chain-of-Thought and enhance structured response quality. It features a 32768 token context length and is designed for tasks requiring aligned, high-quality outputs based on preferred data.

Loading preview...

Model Overview

This model, mohtani777/Qwen3_4B_SFT_DPOv3_agent_v0_LR5E7, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, specifically targeting improved response alignment and quality.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning processes.
  • Structured Responses: Focuses on generating higher quality and more structured outputs.
  • DPO Fine-tuning: Utilizes DPO with a beta of 0.05 over 5 epochs and a learning rate of 5e-07 to align with preferred outputs.
  • Merged Weights: Provides full-merged 16-bit weights, eliminating the need for adapter loading.

Training Details

The model was trained with a maximum sequence length of 1024 and incorporated LoRA configuration (r=8, alpha=16) which has been merged into the base model. The training data used for DPO was sourced from [u-10bei/dpo-dataset-qwen-cot].

Usage Considerations

This model is ready for direct use with the transformers library. Users should be aware that the model's license is MIT (as per the dataset terms), and compliance with the original base model's license terms is also required.