Llama-Carvalho-PT-GL: Multilingual Model for Iberian Romance Languages
Llama-Carvalho-PT-GL is an 8-billion parameter causal language model, continually pretrained by Nos-PT from Meta's Llama-3.1-8B. It is specifically designed to excel in Galician and Portuguese, while maintaining proficiency in Spanish and English. This model is part of the "Carvalho family" of LLMs, which focuses on these languages.
Key Capabilities
- Multilingual Proficiency: Strong performance in Galician, Portuguese, Spanish, and English, with a particular emphasis on the former two.
- Continual Pretraining: Enhanced language understanding through additional training on a diverse corpus of 540M plain text tokens and 72M instruction tokens.
- Text Generation: Ready-to-use for causal language modeling and text generation tasks.
- Fine-tuning Ready: Can be further fine-tuned for specific downstream applications.
Training Details
The model was trained using HuggingFace Transformers and PyTorch, leveraging DeepSpeed for efficiency. The training corpus prioritized Galician (232M tokens) and Portuguese (250M tokens) plain text, alongside Spanish and English. Instruction data also heavily favored Galician (26.7M tokens) and Portuguese (44M tokens).
Evaluation
Initial evaluations on the Open Portuguese LLM Leaderboard show an average score of 60.06, with strong results in tasks like Assin2 RTE (89.30) and HateBR Binary (82.83).
Good For
- Applications requiring high-quality text generation in Galician and Portuguese.
- Developers looking for a base model to fine-tune for specific tasks in these languages.
- Research and development in multilingual NLP, particularly for Iberian Romance languages.