yanolja/YanoljaNEXT-EEVE-10.8B

TEXT GENERATIONConcurrency Cost:1Model Size:15BQuant:FP8Ctx Length:8kTool Calling:SupportedPublished:Feb 7, 2024License:apache-2.0Architecture:Transformer0.1K Open Weights Cold

YanoljaNEXT-EEVE-10.8B is a 10.8 billion parameter language model developed by Yanolja, based on the SOLAR-10.7B-v1.0 architecture. This model is specifically optimized for Korean language tasks through extensive vocabulary expansion and fine-tuning on Korean web-crawled datasets. It leverages a seven-stage training process with parameter freezing to efficiently adapt foundational English models to Korean, enhancing cross-linguistic applicability. While excelling in Korean language understanding, it requires further instruction-based fine-tuning for specific applications.

Loading preview...

YanoljaNEXT-EEVE-10.8B: Korean Vocabulary-Extended Language Model

YanoljaNEXT-EEVE-10.8B is a 10.8 billion parameter model built upon the upstage/SOLAR-10.7B-v1.0 architecture, specifically designed to enhance Korean language capabilities. Developed by Yanolja, this model undergoes a unique vocabulary expansion process and is fine-tuned on diverse Korean web-crawled datasets.

Key Technical Approach

The model employs a sophisticated seven-stage training process with parameter freezing to adapt foundational English models for Korean. This involves pre-training embeddings for new Korean tokens and partially fine-tuning lm_head embeddings for existing tokens, all while preserving the base model's original parameters. This method efficiently extends the model's vocabulary to include Korean, optimizing knowledge transfer and reasoning from English to Korean. More details are available in their technical report: Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models.

Training Details

  • Vocabulary Expansion: 8,960 Korean tokens were meticulously selected based on frequency analysis from a Korean web corpus. This involved iterative tokenizer training, manual curation, and frequency analysis to ensure a rich and relevant vocabulary.
  • Biased Training Data: Training data was intentionally biased to include more texts with new tokens, facilitating effective learning of the expanded vocabulary.

Usage and Limitations

While the model demonstrates strong performance in Korean language tasks, it has not been fine-tuned with instruction-based training. Users should consider further fine-tuning for specific instruction-following applications.