takami2022/qwen3-4b-dpo-v2
The takami2022/qwen3-4b-dpo-v2 is a 4 billion parameter language model, based on the Qwen3 architecture, that has been further fine-tuned using Direct Preference Optimization (DPO). This model is a refinement of its v1 predecessor, specifically re-trained with a reduced DPO beta value of 0.05 to enhance its alignment. It is designed for tasks benefiting from improved preference alignment, building upon the Qwen3 base.
Loading preview...
Model Overview
The takami2022/qwen3-4b-dpo-v2 is a 4 billion parameter language model, representing a refined version of the takami2022/qwen3-4b-dpo-v1 base model. This iteration has undergone additional training using Direct Preference Optimization (DPO) to further enhance its performance and alignment.
Key Training Details
- Base Model:
takami2022/qwen3-4b-dpo-v1 - Optimization Method: Direct Preference Optimization (DPO)
- Epochs: 1
- Learning Rate: 1e-07
- DPO Beta Value: 0.05 (adjusted from 0.1 in the previous version)
- Maximum Sequence Length: 1024
- LoRA Configuration: r=16, alpha=32 (merged into the base model)
What's New in v2?
The primary difference in this version is the adjustment of the DPO beta parameter from 0.1 to 0.05. This change in the beta value typically influences the strength of the preference optimization, aiming for a potentially more nuanced or robust alignment based on the preference data.
Good For
- Applications requiring a 4B parameter model with enhanced alignment through DPO.
- Tasks where fine-grained control over preference optimization is beneficial.
- Further experimentation with DPO-tuned Qwen3-based models.