Overview
This model, mohtani777/Qwen3_4B_SFTV5_DPOv3_agent_v0_LR1E6, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, resulting in a full-merged 16-bit weight model that requires no adapter loading.
Key Optimizations
The primary objective of its DPO training was to align the model's responses with preferred outputs, focusing on two critical areas:
- Enhanced Reasoning: Improved Chain-of-Thought capabilities.
- Structured Response Quality: Better generation of structured outputs based on a preference dataset.
Training Details
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Methodology: Direct Preference Optimization (DPO)
- Epochs: 5
- Learning Rate: 1e-06
- Max Sequence Length: 1024
- Training Data: Utilized the u-10bei/dpo-dataset-qwen-cot dataset.
Usage Considerations
As a merged model, it can be directly integrated and used with the transformers library. Users should be aware that the model's license follows the MIT License, as per the dataset terms, and compliance with the original base model's license is also required.
Ideal Use Cases
This model is particularly well-suited for applications where:
- High-quality, aligned responses are crucial.
- Complex reasoning and Chain-of-Thought capabilities are needed.
- Structured output generation is a priority.