orai-nlp/Gemma-Kimu-2b-base

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2.6BQuant:BF16Ctx Length:8kPublished:Nov 6, 2025License:gemmaArchitecture:Transformer Warm

Gemma-Kimu-2b-base is a 2.6 billion parameter continually pre-trained large language model developed by orai-nlp, built upon Google's Gemma-2-2b foundational architecture. This base model is specifically adapted for the Basque language, enhancing its syntactic, lexical, and morphological competence while preserving English performance through a combination of Basque monolingual data and English replay. It serves as a foundational model for subsequent instruction-tuned versions, demonstrating significant improvements in Basque language understanding and generation fluency.

Loading preview...

Gemma-Kimu-2b-base: Basque Language Adaptation

Gemma-Kimu-2b-base is a 2.6 billion parameter continually pre-trained large language model (LLM) developed by orai-nlp. It is built upon Google's Gemma-2-2b foundational model and focuses exclusively on language adaptation for Basque, without instruction-following alignment. This model serves as a robust base for future instruction-tuned versions, such as Gemma-Kimu-2b-it.

Key Capabilities and Training

  • Basque Language Enhancement: The model undergoes continual pre-training using a combination of Basque monolingual data and English replay. This process significantly improves its syntactic, lexical, and morphological competence in Basque.
  • Performance Preservation: The English replay strategy ensures that the model maintains its proficiency in English, facilitating cross-lingual transfer.
  • Improved Basque Fluency: Evaluations indicate that Gemma-Kimu-2b-base shows substantial improvements in Basque language understanding, coherence, and text generation fluency compared to the original Gemma-2-2b.
  • Training Data: Utilizes the ZelaiHandi dataset (521 million Basque words) and a subset of the FineWeb dataset (300 million English tokens) for its continual pre-training.

Good For

  • Developing Basque-centric LLMs: Ideal as a foundational model for creating instruction-tuned or task-specific models tailored for the Basque language.
  • Research in Low-Resource Language Adaptation: Useful for researchers exploring methods of adapting large language models to languages with fewer digital resources.
  • Applications Requiring Strong Basque Linguistic Understanding: Suitable for tasks demanding high accuracy in Basque text generation, comprehension, and morphological analysis.