Llammas: Estonian-Optimized Llama-2 Instruction Model
Llammas is a 7 billion parameter instruction-tuned model based on Llama-2, developed by tartuNLP with a primary focus on the Estonian language. It represents the first open-source instruction-following Large Language Model specifically designed for Estonian.
Key Capabilities & Training
- Bilingual Pre-training: The model underwent continued pre-training on 5 billion tokens from the CulturaX dataset, with a significant 75% of documents in Estonian and 25% in English.
- Instruction-Tuning: It was instruction-tuned using a diverse set of datasets, including Alpaca-cleaned, Alpaca-est (an Estonian instruction dataset generated with
gpt-3.5-turbo-0613), OASST1 top-1 English conversations, CoT, and FLAN-V2, alongside WMT18 English-Estonian translation data. - Cross-Lingual Knowledge Transfer: Research indicates that this approach significantly enhances performance on Estonian tasks by leveraging cross-lingual instruction-tuning and additional monolingual pre-training.
- Commonsense Reasoning & Multi-turn Conversations: The model demonstrates improved capabilities in commonsense reasoning and multi-turn conversations in Estonian, transferred from high-quality English instructions.
Use Cases & Resources
Llammas is ideal for applications requiring robust language understanding and generation in Estonian. It is particularly suited for tasks that benefit from instruction-following capabilities. For those interested in its development and performance, tartuNLP has published a paper detailing its creation and evaluation. The associated Alpaca-est dataset is also publicly available.