Overview
bxod/Llama-3.2-1B-Instruct-uz is an experimental 1 billion parameter instruction-tuned model, built upon the Llama-3.2 architecture. It has undergone continual pretraining on 1.2 billion tokens, comprising 80% English and 20% Uzbek data, followed by Supervised Fine-Tuning (SFT). A key innovation is its customized tokenizer, which processes Uzbek text at approximately 1.7 tokens per word, a significant improvement over the ~3.5 tokens per word in original Llama models. This optimization leads to roughly 2x faster inference and a longer effective context length when handling Uzbek language.
Key Capabilities and Performance
This model demonstrates strong performance in Uzbek-specific tasks, consistently outperforming its base Llama-3.2 counterparts in:
- Translation Benchmarks: Achieves higher BLEU and COMET scores for both Uzbek-to-English and English-to-Uzbek translation on FLORES+ datasets.
- Uzbek Sentiment Analysis: Shows improved accuracy in classifying sentiment in Uzbek text.
While optimized for Uzbek, the model exhibits a slight decline in MMLU (English) and News Classification scores compared to its base model, indicating some catastrophic forgetting of original English instruction following. It is designed to run efficiently, requiring only 2 GB of VRAM with quantization, making it suitable for small GPUs, edge devices, and mobile applications.
Usage and Recommendations
Users are advised to preprocess Uzbek input by replacing apostrophes with the sequence "APST" to leverage the model's optimized tokenizer fertility. This model is an experimental checkpoint, offering a foundation for further development within the Uzbek open-source community.