uzlm/alloma-8B-Instruct: Uzbek-Optimized Llama Model
This model is an 8-billion parameter instruction-tuned variant of the Llama series, specifically optimized for the Uzbek language. Developed by Examy.me and Teamwork.uz, it addresses the need for efficient Uzbek language processing in large language models.
Key Differentiators & Capabilities
- Efficient Uzbek Tokenization: Features a customized tokenizer that averages 1.7 tokens per Uzbek word, significantly outperforming original Llama models (~3.5 tokens). This results in approximately 2x faster inference and a longer effective context length for Uzbek text.
- Vocabulary Adaptation: Utilizes an in-place vocabulary replacement strategy on the
meta-llama/Llama-3.1 base model. Less relevant non-ASCII tokens were replaced with custom Uzbek tokens, preserving the original architecture and total vocabulary size. - Bilingual Pretraining: Continually pretrained on a 3.6 billion token dataset, comprising 67% English and 33% Uzbek data, followed by Supervised Fine-Tuning (SFT).
- Strong Uzbek Performance: Consistently outperforms base Llama models and other Uzbek-optimized models (like Behbudiy Mistral 7B Uz and Behbudiy Llama 8B Uz) in translation benchmarks (BLEU and COMET for Uz\u2192En and En\u2192Uz) and Uzbek sentiment analysis. For instance,
alloma-8B-Instruct achieves 31.16 BLEU Uz\u2192En compared to Llama-3.1 8B Instruct's 24.23.
Considerations
While excelling in Uzbek-specific tasks, the model shows a slight decline in English MMLU and Uzbek News Classification compared to its base Llama counterpart, attributed to catastrophic forgetting during Uzbek optimization. Users should preprocess Uzbek input by replacing apostrophes with "APST" for optimal tokenizer fertility.