Sabiá-7B: A Portuguese-Optimized LLaMA-1 Model

Sabiá-7B is a 7 billion parameter auto-regressive language model developed by Maritaca AI, leveraging the LLaMA-1-7B architecture and tokenizer. It was extensively pretrained on a 7 billion token Portuguese subset of ClueWeb22, followed by an additional 10 billion tokens of training, making it highly specialized for the Portuguese language.

Key Capabilities & Characteristics

Portuguese Language Focus: Specifically designed and trained for Portuguese, demonstrating strong performance on Portuguese benchmarks like Poeta, where it outperforms LLaMA-1-7B and LLaMA-2-7B.
LLaMA-1 Architecture: Utilizes the LLaMA-1-7B architecture and tokenizer, providing a familiar base for developers.
Few-shot Learning: Recommended for few-shot tasks rather than zero-shot, as it was trained solely on a language modeling objective without instruction-tuning.
Text-only: Accepts and generates text-only input and output.
Research Use Only: Licensed under the same restrictions as LLaMA-1, limiting its use to research purposes.

Performance Highlights

Sabiá-7B shows superior performance on Portuguese benchmarks compared to its LLaMA counterparts. For instance, on the Poeta benchmark, Sabiá-7B achieved an NPM of 48.5, surpassing LLaMA-1-7B (33.0) and LLaMA-2-7B (43.7). While optimized for Portuguese, its performance on English datasets is comparable to LLaMA-1-7B.

Usage Recommendation

This model is ideal for researchers and developers working on Portuguese natural language processing tasks who require a robust base model for few-shot applications. Users should be aware of its research-only license and the need for few-shot prompting due to the lack of instruction-tuning.