tomofusa/exp033-dpo-wd005-merged
The tomofusa/exp033-dpo-wd005-merged model is a 4 billion parameter language model developed by tomofusa, built upon a SFT and DPO merged architecture. This model is provided with full 16-bit weights, eliminating the need for adapter loading. It is specifically fine-tuned using a DPO configuration with a learning rate of 5e-07 and a beta of 0.1, making it suitable for tasks benefiting from advanced alignment techniques.
Loading preview...
Model Overview
The tomoofusa/exp033-dpo-wd005-merged is a 4 billion parameter language model developed by tomofusa. It is a merged model, combining a Supervised Fine-Tuning (SFT) phase with a subsequent Direct Preference Optimization (DPO) phase. This model is distributed with full 16-bit weights, which means it can be used directly without requiring additional adapter loading, simplifying deployment.
Training Details
The model's training pipeline involved two main stages:
- SFT Phase: Initialized from
tomoofusa/exp015-blend-h-lora. - DPO Phase: Further optimized using the
u-10bei/dpo-dataset-qwen-cotdataset for one epoch. Key DPO configuration parameters include a learning rate of5e-07, abetavalue of0.1, and anipoloss type. LoRA was utilized during DPO withr=64andalpha=128, and a maximum sequence length of1024was used.
Key Characteristics
- Merged Architecture: Benefits from both SFT for foundational instruction following and DPO for preference alignment.
- Full 16-bit Weights: Ready-to-use without adapter loading.
- DPO Alignment: Specifically tuned for improved response quality and alignment with human preferences through Direct Preference Optimization.