kyujinpy/Sakura-SOLRCA-Math-Instruct-DPO-v2

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:10.7BQuant:FP8Ctx Length:4kPublished:Dec 26, 2023License:cc-by-nc-sa-4.0Architecture:Transformer0.0K Open Weights Warm

The Sakura-SOLRCA-Math-Instruct-DPO-v2 is a 10.7 billion parameter instruction-tuned causal language model developed by Kyujin Han (kyujinpy). This model is fine-tuned using the DPO method on mathematical and reasoning datasets, including Intel/orca_dpo_pairs and argilla/distilabel-math-preference-dpo. It is optimized for mathematical reasoning and general instruction following, achieving a 63.91% score on GSM8K and a 74.17% average across various benchmarks.

Loading preview...

Model Overview

The kyujinpy/Sakura-SOLRCA-Math-Instruct-DPO-v2 is a 10.7 billion parameter instruction-tuned language model developed by Kyujin Han (kyujinpy). This model leverages the Direct Preference Optimization (DPO) method, fine-tuned on a combination of preference datasets, specifically Intel/orca_dpo_pairs and argilla/distilabel-math-preference-dpo. A merged version of these datasets, kyujinpy/orca_math_dpo, was utilized in its training.

Key Capabilities & Performance

This model is primarily designed for mathematical reasoning and general instruction following. Its performance is highlighted by its benchmark results on the Open LLM Leaderboard:

  • Average Score: 74.17%
  • GSM8K (5-shot): 63.91% (indicating strong mathematical problem-solving ability)
  • MMLU (5-shot): 66.13%
  • HellaSwag (10-shot): 88.52%
  • ARC (25-shot): 71.25%

The v2 iteration shows slight improvements over its predecessor, Sakura-SOLRCA-Math-Instruct-DPO-v1, particularly in overall average and GSM8K scores, demonstrating continuous refinement in its mathematical and reasoning capabilities.

When to Use This Model

This model is particularly well-suited for applications requiring:

  • Mathematical problem-solving and reasoning.
  • General instruction-following tasks where accuracy and logical coherence are important.
  • Research and development in DPO-based fine-tuning for specialized tasks.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p