alloma-8B-Base: Uzbek-Optimized Llama Foundation Model

The alloma-8B-Base is an 8 billion parameter foundational model developed by Examy.me and Teamwork.uz. It is a continually pretrained version of a Llama 8B model, specifically engineered to enhance performance and efficiency for the Uzbek language.

Key Characteristics

Uzbek Optimization: Features a custom tokenizer that processes Uzbek words with significantly fewer tokens (averaging 1.7 tokens per word) compared to original Llama models (approx. 3.5 tokens per word). This leads to a 2x improvement in inference speed and a longer effective context length for Uzbek text.
Training Data: Continually pretrained on 3.6 billion tokens, with a distribution of 67% English and 33% Uzbek data.
Context Length: Supports a context length of 4096 tokens during its continual pretraining phase.

Use Cases

This base model is ideal for developers and researchers building applications that require robust language understanding and generation capabilities in Uzbek. It serves as a strong foundation for further fine-tuning into instruction-following models for specific Uzbek NLP tasks, offering superior efficiency and performance for the language.

Overview

alloma-8B-Base: Uzbek-Optimized Llama Foundation Model

Key Characteristics

Use Cases

Full Model Card (README)