mohtani777/Qwen3_4B_SFT_DPOv3_agent_v0_LR1E7

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 28, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

mohtani777/Qwen3_4B_SFT_DPOv3_agent_v0_LR1E7 is a 4 billion parameter Qwen3-based instruction-tuned language model, fine-tuned using Direct Preference Optimization (DPO) by mohtani777. This model is specifically optimized to enhance reasoning capabilities through Chain-of-Thought and improve structured response quality. It is designed for tasks requiring refined conversational outputs and logical reasoning.

Loading preview...

Overview

This model, mohtani777/Qwen3_4B_SFT_DPOv3_agent_v0_LR1E7, is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base. It has been fine-tuned by mohtani777 using Direct Preference Optimization (DPO) via the Unsloth library. The fine-tuning process involved 5 epochs with a learning rate of 1e-07 and a beta of 0.05, targeting improved response alignment with preferred outputs.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning abilities.
  • Structured Responses: Focuses on generating higher quality, structured outputs.
  • DPO Fine-tuning: Leverages Direct Preference Optimization for better alignment with human preferences.
  • Full-Merged Weights: Provides full-merged 16-bit weights, eliminating the need for adapter loading.

Good For

  • Applications requiring models with refined reasoning skills.
  • Use cases where structured and high-quality conversational responses are critical.
  • Developers looking for a Qwen3-based model with DPO-enhanced performance in agentic or instructional contexts.

Technical Details

  • Base Model: Qwen/Qwen3-4B-Instruct-2507
  • Optimization Method: DPO
  • Max Sequence Length: 1024
  • License: MIT License (derived from the dataset terms), with compliance to the original base model's license terms.