norallm/normistral-11b-warm

Warm
Public
12B
FP8
32768
1
Sep 26, 2024
License: apache-2.0
Hugging Face

NorMistral-11b-warm is a 12 billion parameter Norwegian language model developed by the Language Technology Group at the University of Oslo (LTG) as part of the NORA.LLM family. Initialized from Mistral-Nemo-Base-2407, it underwent continual pretraining on 250 billion subword tokens, including a mix of Scandinavian, Sámi, English, and code data. This model is specifically optimized for Norwegian and other Scandinavian languages, featuring a new tokenizer for faster inference and hybrid masked-causal training, making it suitable for both causal generative and bidirectional encoder tasks.

Overview

NorMistral-11b-warm: A Specialized Norwegian Language Model

NorMistral-11b-warm is a 12 billion parameter language model from the NORA.LLM family, developed by the Language Technology Group at the University of Oslo (LTG). It is built upon the Mistral architecture, initialized from Mistral-Nemo-Base-2407, and has undergone extensive continual pretraining on 250 billion subword tokens. This training corpus includes a diverse mix of Norwegian (Bokmål and Nynorsk), Sámi, other Scandinavian languages, high-quality English, and programming code, with strategic upsampling for low-resource languages.

Key Capabilities & Features

  • Optimized for Scandinavian Languages: Features a new tokenizer specifically trained on target languages, resulting in substantially faster inference compared to its base model.
  • Hybrid Training Objective: Utilizes a unique hybrid masked-causal training approach (90% causal LM, 10% masked next-token prediction), allowing it to function as both a causal generative model and a bidirectional encoder.
  • Multilingual Data Integration: Trained on a carefully balanced corpus including Norwegian, Sámi, Danish, Swedish, Icelandic, Faroese, English, and code, enabling robust performance across these languages.
  • Mistral Architecture: Incorporates advanced architectural features like pre-normalization with RMSNorm, SwiGLU activation, Rotary positional embeddings, and Grouped-query attention.

When to Use This Model

  • Norwegian Language Tasks: Ideal for applications requiring high proficiency in Norwegian, including translation, text generation, and understanding.
  • Research in Continual Pretraining: Particularly relevant for researchers studying the effects of continual training on small languages and language transfer.
  • Bidirectional Encoding: Can be fine-tuned for tasks typically handled by BERT-like models, such as masked language modeling, due to its hybrid training.

This model is primarily intended for research purposes and is released under the Apache 2.0 license. It is not instruction-tuned and may generate inappropriate content.