Overview
Sakura-SOLAR-Instruct-DPO-v2 Overview
Sakura-SOLAR-Instruct-DPO-v2 is a 10.7 billion parameter language model developed by Kyujin Han (kyujinpy) as part of an LLM research consortium with Media Group Saramgwasup and Marker. This model is an instruction-tuned variant, specifically enhanced using the Direct Preference Optimization (DPO) method.
Key Capabilities & Training
- DPO Fine-tuning: The model was fine-tuned using the DPO method, leveraging the argilla/distilabel-math-preference-dpo dataset, which suggests an emphasis on mathematical reasoning and preference alignment.
- Benchmark Performance: On the Open LLM Leaderboard, Sakura-SOLAR-Instruct-DPO-v2 achieves an average score of 74.14. Notable scores include:
- AI2 Reasoning Challenge (ARC): 70.90
- HellaSwag: 88.41
- MMLU: 66.48
- TruthfulQA: 71.86
- Winogrande: 83.43
- GSM8k: 63.76
- Model Lineage: This version is an iteration following
kyujinpy/Sakura-SOLAR-Instruct, showing slight improvements in some metrics like MMLU and GSM8k.
Usage
This model is suitable for tasks requiring instruction following and general language understanding, particularly benefiting from its DPO fine-tuning for improved response quality and alignment. Its performance across various benchmarks indicates its versatility for a range of NLP applications.