Name: yanolja/YanoljaNEXT-EEVE-10.8B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: yanolja

YanoljaNEXT-EEVE-10.8B: Korean Vocabulary-Extended Language Model

YanoljaNEXT-EEVE-10.8B is a 10.8 billion parameter model built upon the upstage/SOLAR-10.7B-v1.0 architecture, specifically designed to enhance Korean language capabilities. Developed by Yanolja, this model undergoes a unique vocabulary expansion process and is fine-tuned on diverse Korean web-crawled datasets.

Key Technical Approach

The model employs a sophisticated seven-stage training process with parameter freezing to adapt foundational English models for Korean. This involves pre-training embeddings for new Korean tokens and partially fine-tuning lm_head embeddings for existing tokens, all while preserving the base model's original parameters. This method efficiently extends the model's vocabulary to include Korean, optimizing knowledge transfer and reasoning from English to Korean. More details are available in their technical report: Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models.

Training Details

Vocabulary Expansion: 8,960 Korean tokens were meticulously selected based on frequency analysis from a Korean web corpus. This involved iterative tokenizer training, manual curation, and frequency analysis to ensure a rich and relevant vocabulary.
Biased Training Data: Training data was intentionally biased to include more texts with new tokens, facilitating effective learning of the expanded vocabulary.

Usage and Limitations

While the model demonstrates strong performance in Korean language tasks, it has not been fine-tuned with instruction-based training. Users should consider further fine-tuning for specific instruction-following applications.

Overview

YanoljaNEXT-EEVE-10.8B: Korean Vocabulary-Extended Language Model

Key Technical Approach

Training Details

Usage and Limitations

Full Model Card (README)