M-DIE-M-10.7B: Korean-Optimized Instruction Model
M-DIE-M-10.7B is a 10.7 billion parameter instruction-tuned language model developed by Ados, building upon the upstage/SOLAR-10.7B-Instruct-v1.0 base. Its primary differentiator is a strong focus on the Korean language, with 73% of its training data being Korean, alongside 24% English and 3% other languages.
Key Capabilities & Training
- Korean Language Proficiency: Optimized for Korean through a custom-curated dataset, making it highly effective for Korean-centric applications.
- Diverse Instruction Following: Trained on a varied dataset including:
- Single-turn QA (Alpaca style): 29%
- Multi-turn QA (Vicuna style): 21%
- Instructed QA: 26%
- Summarization: 12%
- Translation: 12%
- Data Quality: The training data was meticulously processed, involving manual selection of 30% high-quality rows, deduplication, and refinement to address issues like code blocks, listing, and repetition.
- Prompt Template: Utilizes a specific prompt format with
### System: and ### User: sections, identifying itself as "OLLM (오름) by Ados (주식회사아도스)".
Licensing
The model is released under the CC-BY-NC-4.0 license, inheriting this from its base model and due to the inclusion of non-commercial datasets like Alpaca in its fine-tuning process.
Good For
- Applications requiring strong performance in Korean language understanding and generation.
- Building AI assistants or chatbots for Korean-speaking users.
- Tasks involving Korean QA, summarization, and multi-turn conversations.