kyujinpy/Sakura-SOLAR-Instruct-DPO-v2

Warm
Public
10.7B
FP8
4096
Dec 24, 2023
License: cc-by-nc-sa-4.0
Hugging Face
Overview

Sakura-SOLAR-Instruct-DPO-v2 Overview

Sakura-SOLAR-Instruct-DPO-v2 is a 10.7 billion parameter language model developed by Kyujin Han (kyujinpy) as part of an LLM research consortium with Media Group Saramgwasup and Marker. This model is an instruction-tuned variant, specifically enhanced using the Direct Preference Optimization (DPO) method.

Key Capabilities & Training

  • DPO Fine-tuning: The model was fine-tuned using the DPO method, leveraging the argilla/distilabel-math-preference-dpo dataset, which suggests an emphasis on mathematical reasoning and preference alignment.
  • Benchmark Performance: On the Open LLM Leaderboard, Sakura-SOLAR-Instruct-DPO-v2 achieves an average score of 74.14. Notable scores include:
    • AI2 Reasoning Challenge (ARC): 70.90
    • HellaSwag: 88.41
    • MMLU: 66.48
    • TruthfulQA: 71.86
    • Winogrande: 83.43
    • GSM8k: 63.76
  • Model Lineage: This version is an iteration following kyujinpy/Sakura-SOLAR-Instruct, showing slight improvements in some metrics like MMLU and GSM8k.

Usage

This model is suitable for tasks requiring instruction following and general language understanding, particularly benefiting from its DPO fine-tuning for improved response quality and alignment. Its performance across various benchmarks indicates its versatility for a range of NLP applications.