NorMistral-11b-long: Extended Context for Scandinavian Languages
NorMistral-11b-long is an 11.4 billion parameter causal language model developed by the Language Technology Group at the University of Oslo (LTG) within the NORA.LLM initiative. It is a length-extended version of NorMistral-11b-warm, designed with a significantly increased context window of 32,768 tokens.
Key Capabilities & Features
- Extended Context: Achieves a 32,768 token context length through continual training on an additional 50 billion subword tokens.
- Multilingual Focus: Training data includes a mix of Scandinavian (Norwegian Bokmål, Nynorsk, Danish, Swedish, Icelandic, Faroese), Sámi, English, and programming code.
- Optimized Tokenizer: Utilizes a new, specially trained tokenizer for target languages, resulting in substantially faster inference compared to the base Mistral-Nemo-Base-2407 model, with improved subword-to-word split ratios.
- Architecture: Based on the Mistral architecture, featuring pre-normalization with RMSNorm, SwiGLU activation, Rotary positional embeddings, and Grouped-query attention.
- Research-Oriented: Primarily intended for research purposes, particularly in the domain of low-resource and Scandinavian language processing.
Good For
- Research in Scandinavian NLP: Ideal for academic and research applications focusing on Norwegian, Sámi, and other Nordic languages.
- Long-Context Tasks: Suitable for tasks requiring processing extensive textual inputs, thanks to its 32,768 token context window.
- Continual Training Studies: A practical example of continual training for language extension, following the methodology outlined in the paper "Small Languages, Big Models: A Study of Continual Training on Languages of Norway."
- Efficient Inference: Benefits from a custom tokenizer that enhances inference speed for its target languages.