tartuNLP/Llama-3.1-EstLLM-8B-0525
The tartuNLP/Llama-3.1-EstLLM-8B-0525 is an 8 billion parameter causal language model developed by TartuNLP and TalTechNLP, continuously pre-trained from meta-llama/Llama-3.1-8B on approximately 35 billion tokens, including significant Estonian, Python, and mathematical datasets. This base model is specifically optimized for enhancing Estonian language capabilities and is intended for further fine-tuning on downstream tasks rather than direct instruction-following. It demonstrates strong performance in Estonian language benchmarks, outperforming its base model and several other 8B-class models in various Estonian tasks and translation metrics.
Loading preview...
Model Overview
The tartuNLP/Llama-3.1-EstLLM-8B-0525 is an 8 billion parameter base text completion model developed by TartuNLP and TalTechNLP, funded by the Estonian Ministry of Education and Research. It is a continued pre-training of the original meta-llama/Llama-3.1-8B model, with an additional 35 billion tokens of training data. This dataset includes the Estonian National Corpus (8.6B tokens), Python-Edu (3.3B tokens), FineMath4-Plus (9.5B tokens), General Instruction-Augmented Corpora (7.4B tokens), and Cosmopedia v2 (6.9B tokens).
Key Capabilities
- Enhanced Estonian Language Proficiency: Significantly improved performance on various Estonian benchmarks, including
belebele-et,exam-et,grammar-et,inflection-et,trivia-et,winogrande-et,xcopa-et, andGlobalPIQA-et, often surpassing its Llama 3.1 base and other comparable models. - Multilingual Support: While primarily focused on Estonian, it also maintains strong English language capabilities, as evidenced by its performance on
belebele-en,MMLU-Redux, andwinograndebenchmarks. - Translation Performance: Achieves competitive BLEU scores for Estonian to English and English to Estonian translation tasks.
Good For
- Fine-tuning for Estonian NLP tasks: This model is explicitly designed as a base model for further fine-tuning on specific downstream tasks requiring strong Estonian language understanding and generation.
- Research and Development: Ideal for researchers exploring continued pre-training techniques and developing specialized LLMs for less-resourced languages like Estonian.
Limitations
- Base Model: It is a base text completion model and not instruction-tuned, meaning it is not suitable for direct chat or instruction-following without further fine-tuning.
- Context Size: The continued training was performed with a sequence length of 4096 tokens, which may result in a somewhat limited effective context size compared to models trained with longer contexts.