NorMistral-7b-scratch: A Norwegian-First LLM
NorMistral-7b-scratch is a 7 billion parameter large language model, part of the NORA.LLM family, developed by the Language Technology Group at the University of Oslo in collaboration with HPLT, the National Library of Norway, and the University of Turku. This model is unique as it was pretrained entirely from scratch on a massive dataset of 260 billion subword tokens, predominantly consisting of open Norwegian texts (Bokmål and Nynorsk), augmented with 20% code data from Starcoder.
Key Capabilities
- Native Norwegian Language Proficiency: Specifically trained on a comprehensive Norwegian corpus, ensuring deep understanding and generation capabilities for both Bokmål and Nynorsk.
- Mistral Architecture: Leverages an optimized transformer architecture based on the efficient Mistral/Llama models.
- Research-Oriented: Primarily intended for research purposes, providing a strong foundation for further fine-tuning and development in Norwegian NLP.
- Multilingual Translation: Demonstrates strong performance in machine translation tasks involving English, Bokmål, and Nynorsk, particularly in Norwegian-to-Norwegian translation.
Good for
- Developing Norwegian NLP Applications: Ideal for researchers and developers building applications that require robust performance in the Norwegian language.
- Machine Translation: Excels in translation tasks between English and Norwegian (Bokmål/Nynorsk), and especially between Bokmål and Nynorsk.
- Grammatical Error Correction: Shows promising results in Norwegian grammatical error correction, particularly in few-shot settings.
- Sentiment Analysis and Reading Comprehension: Provides a solid baseline for tasks like sentiment analysis and extractive question answering in Norwegian.