uzlm/alloma-8B-Instruct
uzlm/alloma-8B-Instruct is an 8-billion parameter Llama-series instruction-tuned model developed by Examy.me and Teamwork.uz, optimized for the Uzbek language. It features a customized tokenizer that processes Uzbek text twice as fast as original Llama models, enabling longer effective context lengths. Continually pretrained on a bilingual dataset (67% English, 33% Uzbek), this model excels in Uzbek translation and sentiment analysis tasks.
Loading preview...
uzlm/alloma-8B-Instruct: Uzbek-Optimized Llama Model
This model is an 8-billion parameter instruction-tuned variant of the Llama series, specifically optimized for the Uzbek language. Developed by Examy.me and Teamwork.uz, it addresses the need for efficient Uzbek language processing in large language models.
Key Differentiators & Capabilities
- Efficient Uzbek Tokenization: Features a customized tokenizer that averages 1.7 tokens per Uzbek word, significantly outperforming original Llama models (~3.5 tokens). This results in approximately 2x faster inference and a longer effective context length for Uzbek text.
- Vocabulary Adaptation: Utilizes an in-place vocabulary replacement strategy on the
meta-llama/Llama-3.1base model. Less relevant non-ASCII tokens were replaced with custom Uzbek tokens, preserving the original architecture and total vocabulary size. - Bilingual Pretraining: Continually pretrained on a 3.6 billion token dataset, comprising 67% English and 33% Uzbek data, followed by Supervised Fine-Tuning (SFT).
- Strong Uzbek Performance: Consistently outperforms base Llama models and other Uzbek-optimized models (like Behbudiy Mistral 7B Uz and Behbudiy Llama 8B Uz) in translation benchmarks (BLEU and COMET for Uz\u2192En and En\u2192Uz) and Uzbek sentiment analysis. For instance,
alloma-8B-Instructachieves 31.16 BLEU Uz\u2192En compared to Llama-3.1 8B Instruct's 24.23.
Considerations
While excelling in Uzbek-specific tasks, the model shows a slight decline in English MMLU and Uzbek News Classification compared to its base Llama counterpart, attributed to catastrophic forgetting during Uzbek optimization. Users should preprocess Uzbek input by replacing apostrophes with "APST" for optimal tokenizer fertility.