NorMistral-11b-warm: A Specialized Norwegian Language Model

NorMistral-11b-warm is a 12 billion parameter language model from the NORA.LLM family, developed by the Language Technology Group at the University of Oslo (LTG). It is built upon the Mistral architecture, initialized from Mistral-Nemo-Base-2407, and has undergone extensive continual pretraining on 250 billion subword tokens. This training corpus includes a diverse mix of Norwegian (Bokmål and Nynorsk), Sámi, other Scandinavian languages, high-quality English, and programming code, with strategic upsampling for low-resource languages.

Key Capabilities & Features

Optimized for Scandinavian Languages: Features a new tokenizer specifically trained on target languages, resulting in substantially faster inference compared to its base model.
Hybrid Training Objective: Utilizes a unique hybrid masked-causal training approach (90% causal LM, 10% masked next-token prediction), allowing it to function as both a causal generative model and a bidirectional encoder.
Multilingual Data Integration: Trained on a carefully balanced corpus including Norwegian, Sámi, Danish, Swedish, Icelandic, Faroese, English, and code, enabling robust performance across these languages.
Mistral Architecture: Incorporates advanced architectural features like pre-normalization with RMSNorm, SwiGLU activation, Rotary positional embeddings, and Grouped-query attention.

When to Use This Model

Norwegian Language Tasks: Ideal for applications requiring high proficiency in Norwegian, including translation, text generation, and understanding.
Research in Continual Pretraining: Particularly relevant for researchers studying the effects of continual training on small languages and language transfer.
Bidirectional Encoding: Can be fine-tuned for tasks typically handled by BERT-like models, such as masked language modeling, due to its hybrid training.

This model is primarily intended for research purposes and is released under the Apache 2.0 license. It is not instruction-tuned and may generate inappropriate content.