irlab-udc/Llama-3.1-8B-Instruct-Galician
The irlab-udc/Llama-3.1-8B-Instruct-Galician model, also known as Cabuxa 2.0, is an 8 billion parameter instruction-tuned causal language model developed by UDC Information Retrieval Lab (IRLab). It is a continued pretraining version of Meta's Llama-3.1-8B-Instruct, specifically adapted for the Galician language using the CorpusNós dataset. This model is optimized for natural language processing tasks in Galician, aiming to improve AI accessibility for underrepresented languages.
Loading preview...
Model Overview
irlab-udc/Llama-3.1-8B-Instruct-Galician, or Cabuxa 2.0, is an 8 billion parameter instruction-tuned language model developed by the UDC Information Retrieval Lab (IRLab). It is built upon Meta's Llama-3.1-8B-Instruct and has undergone continued pretraining using the CorpusNós dataset, specifically to enhance its capabilities in the Galician language.
Key Capabilities
- Galician Language Adaptation: The model is specifically fine-tuned for natural language processing in Galician, addressing the underrepresentation of minority languages in LLMs.
- Instruction Following: Inherits instruction-following capabilities from its Llama-3.1-8B-Instruct base, adapted for Galician-specific prompts.
- Performance: In evaluations, this model has shown to outperform both the base Llama-3.1 model and another Galician model in quantitative and qualitative terms for Galician NLP tasks.
Training Details
The model was trained with a learning rate of 0.0001, a batch size of 32, and for 1.0 epoch. The training utilized 4 NVIDIA A100 SXM4 80 GB GPUs for 60 hours, resulting in an estimated carbon emission of 10.37 Kg. CO₂ eq.
Use Cases
This model is ideal for applications requiring robust language understanding and generation in Galician, such as:
- Conversational AI systems in Galician.
- Text generation and summarization for Galician content.
- Research and development in NLP for underrepresented languages.