Name: orai-nlp/Gemma-Kimu-2b-base API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: orai-nlp

Gemma-Kimu-2b-base: Basque Language Adaptation

Gemma-Kimu-2b-base is a 2.6 billion parameter continually pre-trained large language model (LLM) developed by orai-nlp. It is built upon Google's Gemma-2-2b foundational model and focuses exclusively on language adaptation for Basque, without instruction-following alignment. This model serves as a robust base for future instruction-tuned versions, such as Gemma-Kimu-2b-it.

Key Capabilities and Training

Basque Language Enhancement: The model undergoes continual pre-training using a combination of Basque monolingual data and English replay. This process significantly improves its syntactic, lexical, and morphological competence in Basque.
Performance Preservation: The English replay strategy ensures that the model maintains its proficiency in English, facilitating cross-lingual transfer.
Improved Basque Fluency: Evaluations indicate that Gemma-Kimu-2b-base shows substantial improvements in Basque language understanding, coherence, and text generation fluency compared to the original Gemma-2-2b.
Training Data: Utilizes the ZelaiHandi dataset (521 million Basque words) and a subset of the FineWeb dataset (300 million English tokens) for its continual pre-training.

Good For

Developing Basque-centric LLMs: Ideal as a foundational model for creating instruction-tuned or task-specific models tailored for the Basque language.
Research in Low-Resource Language Adaptation: Useful for researchers exploring methods of adapting large language models to languages with fewer digital resources.
Applications Requiring Strong Basque Linguistic Understanding: Suitable for tasks demanding high accuracy in Basque text generation, comprehension, and morphological analysis.

Overview

Gemma-Kimu-2b-base: Basque Language Adaptation

Key Capabilities and Training

Good For

Full Model Card (README)