NbAiLab/nb-notram-llama-3.1-8b-instruct Overview
This model is part of the "NB-Llama-3.x" series and the "NoTraM" line of work, developed by the National Library of Norway (NB-AiLab). It is an 8 billion parameter instruction-tuned model based on Meta's Llama-3.1-8B-Instruct, with a 32768 token context length. The primary goal of this fine-tuning is to significantly improve instruction-following capabilities in Norwegian Bokmål and Norwegian Nynorsk, while also preserving its strong performance in English.
Key Capabilities & Features
- Multilingual Proficiency: Strong in Norwegian Bokmål, Norwegian Nynorsk, and English.
- Concise Response Style: The model is tuned to provide shorter, more direct answers, differing from more verbose conversational assistants.
- Public Data Training: Uniquely trained exclusively on publicly available datasets, including CulturaX, HPLT monolingual, Norwegian Colossal Corpus, and Wikipedia, without using legal deposit material.
- Advanced Data Curation: Employs a data selection methodology inspired by FineWeb, utilizing custom "Corpus Quality Classifiers" to prioritize educational value and linguistic quality in Norwegian content.
- Instruction-Following Focus: Primarily uses Supervised Fine-Tuning (SFT) with a light Preference Optimization (DPO) step to stabilize instruction-following.
Good For
- Developing dialogue systems and assistant-style applications in Norwegian (Bokmål/Nynorsk) and English.
- Summarization and Question & Answer tasks specifically in Norwegian Bokmål or Nynorsk.
- Research into adapting instruction-tuned models for smaller languages using public data, aiming to reduce "knowledge pocketing".