GAIA: An Open Language Model for Brazilian Portuguese
GAIA (Gemma-3-Gaia-PT-BR-4b-it) is a 4.3 billion parameter language model specifically developed for Brazilian Portuguese. It was created through a collaboration between the Center of Excellence in Artificial Intelligence (CEIA-UFG), The Brazilian Association of AI (ABRIA), Nama, Amadeus AI, and Google DeepMind. The model is based on google/gemma-3-4b-pt and underwent continuous pre-training on an extensive 13 billion token corpus of high-quality Portuguese data, including scientific articles and Wikipedia.
Key Capabilities
- Brazilian Portuguese Specialization: Deep understanding and generation of text in Brazilian Portuguese.
- Instruction Following: Designed to follow instructions for chat, question answering, and content generation.
- Robust Foundation: Serves as a strong base model for fine-tuning on specific Portuguese NLP tasks.
Performance Highlights
GAIA demonstrates competitive performance against the google/gemma-3-4b-it baseline, notably achieving a significant improvement on the ENEM 2024 benchmark (0.7000 vs 0.6556). Its development involved a unique weight merging technique to restore instruction-following capabilities after continuous pre-training, as detailed in the paper "Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs".
Good for
- Direct use in chat, summarization, and creative content generation in Portuguese.
- Fine-tuning for sentiment analysis, RAG systems, document classification, and specialized chatbots in Portuguese.