Model Overview
The tartuNLP/Llama-3.1-EstLLM-8B-0525 is an 8 billion parameter base text completion model developed by TartuNLP and TalTechNLP, funded by the Estonian Ministry of Education and Research. It is a continued pre-training of the original meta-llama/Llama-3.1-8B model, with an additional 35 billion tokens of training data. This dataset includes the Estonian National Corpus (8.6B tokens), Python-Edu (3.3B tokens), FineMath4-Plus (9.5B tokens), General Instruction-Augmented Corpora (7.4B tokens), and Cosmopedia v2 (6.9B tokens).
Key Capabilities
- Enhanced Estonian Language Proficiency: Significantly improved performance on various Estonian benchmarks, including
belebele-et, exam-et, grammar-et, inflection-et, trivia-et, winogrande-et, xcopa-et, and GlobalPIQA-et, often surpassing its Llama 3.1 base and other comparable models. - Multilingual Support: While primarily focused on Estonian, it also maintains strong English language capabilities, as evidenced by its performance on
belebele-en, MMLU-Redux, and winogrande benchmarks. - Translation Performance: Achieves competitive BLEU scores for Estonian to English and English to Estonian translation tasks.
Good For
- Fine-tuning for Estonian NLP tasks: This model is explicitly designed as a base model for further fine-tuning on specific downstream tasks requiring strong Estonian language understanding and generation.
- Research and Development: Ideal for researchers exploring continued pre-training techniques and developing specialized LLMs for less-resourced languages like Estonian.
Limitations
- Base Model: It is a base text completion model and not instruction-tuned, meaning it is not suitable for direct chat or instruction-following without further fine-tuning.
- Context Size: The continued training was performed with a sequence length of 4096 tokens, which may result in a somewhat limited effective context size compared to models trained with longer contexts.