mohtani777/Qwen3_4B_SFT_DPOv1_agent_v0

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 27, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

mohtani777/Qwen3_4B_SFT_DPOv1_agent_v0 is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is specifically optimized to enhance reasoning capabilities through Chain-of-Thought and improve the quality of structured responses. It is designed for applications requiring aligned and coherent outputs based on preferred data.

Loading preview...

Overview

This model, mohtani777/Qwen3_4B_SFT_DPOv1_agent_v0, is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has undergone fine-tuning using Direct Preference Optimization (DPO) via the Unsloth library. The repository provides the full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning processes.
  • Structured Response Quality: Focuses on generating higher quality and more structured outputs.
  • Preference Alignment: Aligned with preferred outputs through DPO training on a specific preference dataset.

Training Details

The model was trained for 5 epochs with a learning rate of 1e-06 and a beta value of 0.05. It utilized a maximum sequence length of 1024. The LoRA configuration (r=8, alpha=16) was merged into the base model during the fine-tuning process. The training data used is sourced from u-10bei/dpo-dataset-qwen-cot.

Usage

As a merged model, it can be directly integrated and used with the transformers library for inference. Users must adhere to the MIT License of the training data and the original base model's license terms.