Latxa: An Open Language Model for Basque
Latxa is a family of large language models developed by the HiTZ Research Center & IXA Research group, specifically designed to address the limitations of existing LLMs for low-resource languages like Basque. This particular model, latxa-7b-v1.2, is a 7 billion parameter variant based on Meta's Llama 2 architecture.
Key Capabilities & Features
- Basque Language Specialization: Continuously pre-trained on a new, high-quality Basque corpus comprising 4.3 million documents and 4.2 billion tokens, enabling strong performance in Basque.
- Performance: Outperforms all previous open models for Basque by a significant margin and demonstrates competitive language proficiency and understanding compared to GPT-4 Turbo in Basque-specific tasks.
- Architecture: Inherits the Llama 2 architecture, providing a robust foundation.
- Open Availability: Both the Latxa models and the new pretraining corpora and evaluation datasets are publicly available under open licenses, fostering research in low-resource language LLMs.
- Multilingual Context: While primarily focused on Basque, the training data also included 500K English documents from the Pile dataset to prevent catastrophic forgetting.
Intended Use Cases
- Basque Language Processing: Ideal for tasks requiring deep understanding and generation in Basque.
- Further Fine-tuning: As a pre-trained LLM, it is suitable for further fine-tuning on specific Basque-centric applications or tasks.
- Research on Low-Resource Languages: Provides a valuable resource for researchers exploring methods to build LLMs for languages with limited digital resources.
Limitations
- Language Specificity: Performance is not guaranteed for languages other than Basque.
- No Instruction Fine-tuning: The model is pre-trained and not instruction-tuned or designed as a chat assistant, so direct instruction-following or conversational use is not recommended without further fine-tuning.