Canarim-7B: A Portuguese-Specialized LLM
Canarim-7B is a 7 billion parameter large language model developed by Maicon Domingues, specifically designed for the Portuguese language. It was pretrained on 16 billion tokens from the Portuguese subset of CommonCrawl 2023-23, leveraging the robust LLaMA2-7B architecture as its foundation.
Key Capabilities
- Portuguese Language Specialization: Optimized for understanding and generating text in Portuguese, making it highly suitable for applications targeting Portuguese-speaking users.
- Robust Architecture: Inherits the efficient and reliable architecture of LLaMA2-7B.
- Diverse Pretraining Data: Trained on a wide range of Portuguese text, enhancing its ability to handle various contexts and nuances.
- Few-shot Learning: Best suited for tasks where a few examples of the desired outcome can be provided, rather than zero-shot scenarios.
Good For
- Natural Language Understanding (NLU): Effective for tasks like sentiment analysis, topic classification, and entity recognition in Portuguese, especially with relevant examples.
- Natural Language Generation (NLG): Capable of generating coherent and contextually appropriate Portuguese text for content creation or chatbots, with improved results when given style or format examples.
- Language Translation: Suitable for high-quality translation involving Portuguese, particularly when examples are included during fine-tuning or inference.
Performance Highlights
Evaluations on the Open PT LLM Leaderboard show an average score of 47.36, with notable results in tasks like HateBR (78.48) and ASSIN2 RTE (71.96). On the Open LLM Leaderboard, it achieved an average of 48.63, including 77.52 on HellaSwag (10-Shot) and 71.43 on Winogrande (5-shot).