Llama-3.1-Carballo: Multilingual LLM for Galician and Ibero-Romance Languages
Llama-3.1-Carballo is an 8-billion parameter causal language model from the Carballo family, developed by proxectonos. It is built upon Meta's Llama-3.1-8B and has undergone continual pretraining with a diverse multilingual corpus of nearly 20 billion tokens, with a significant emphasis on Galician texts. This specialization aims to enhance its performance and linguistic understanding for Galician, while maintaining proficiency in Spanish, English, and adapting to Portuguese (PT) and Catalan.
Key Capabilities
- Multilingual Text Generation: Capable of generating text in Galician, Portuguese, Spanish, Catalan, and English.
- Causal Language Modeling: Designed for tasks such as translation, question answering, sentiment analysis, and named entity recognition, as demonstrated by provided examples.
- Continual Pretraining: Benefits from targeted pretraining on a specialized corpus to improve performance in less-resourced languages.
Training Details
The model was trained using HuggingFace Transformers and PyTorch, leveraging DeepSpeed for efficiency. The training corpus included 5 billion tokens of Galician (primarily from CorpusNós), 3 billion Portuguese, 3.5 billion Spanish, 3.4 billion English, and 3.6 billion Catalan (from CATalog). Training was conducted over 1 epoch on NVIDIA A100 GPUs at the Galicia Supercomputing Center (CESGA).
Intended Uses
Llama-3.1-Carballo is ready-to-use for causal language modeling and can be fine-tuned for specific downstream tasks. It is particularly well-suited for applications requiring strong performance in Galician and other Ibero-Romance languages.