Overview
beomi/gemma-mling-7b is an 8.5 billion parameter Gemma-based model developed by Junbum Lee and Taekyoon Choi. It has been continually pretrained on a diverse multilingual corpus, with a particular focus on Korean, English, Chinese, and Japanese, in addition to 500 other languages. The training involved a mixture of multiple language datasets, including sampled Llama2ko datasets for Korean and English, SkyPile-150B, Glot500, and cc100-ja, totaling 486.7 GB of data.
Key Capabilities
- Multilingual Text Generation: Optimized for generating text in Korean, English, Chinese, and Japanese, as well as other languages from its 500-language corpus.
- Causal Language Modeling: Functions as a base model for various natural language processing tasks.
- Research and Development: Serves as a foundation for NLP research and development, allowing for experimentation with new techniques and algorithms.
Evaluation Highlights
The model's performance was evaluated using various benchmarks, including EleutherAI/lm-evaluation-harness and Stability-AI/lm-evaluation-harness (jp-stable branch). Notable scores include 61.76 on MMLU (5-shot), 42.75 on KMMLU (Exact Match), and strong results across XCOPA and JP Eval Harness tasks, demonstrating its multilingual proficiency.
Intended Usage
This model is suitable for a wide range of applications, including content creation (text generation, code, marketing copy), research (NLP research, knowledge exploration), and educational tools (language learning, grammar correction). Its multilingual focus makes it particularly valuable for applications targeting diverse linguistic audiences.