Tucano-1b1: A Portuguese-Native Foundational LLM
Tucano-1b1 is a 1.1 billion parameter Transformer-based causal language model developed by TucanoBR. It is part of the Tucano series, which focuses on models natively pretrained in Portuguese. The model was trained on the extensive GigaVerbo dataset, a 200 billion token corpus of deduplicated Portuguese text, ensuring deep linguistic understanding of the language.
Key Capabilities & Features
- Native Portuguese Pretraining: Specifically designed and trained for the Portuguese language, unlike many multilingual models.
- Foundational Model: Intended for research and development, providing a controlled setting for experiments and a base for fine-tuning.
- Causal Language Modeling: Pre-trained via causal language modeling, making it suitable for text generation tasks in Portuguese.
- Context Length: Supports a context length of 2048 tokens.
Intended Use Cases
- Research & Development: Ideal for academic and industrial research involving Portuguese language modeling.
- Comparative Experiments: Useful for studying the effects of active pretraining on benchmarks.
- Fine-tuning: Can be adapted and fine-tuned for specific downstream applications in Portuguese, provided users conduct their own risk and bias assessments.
Limitations
It is important to note that Tucano-1b1 is not intended for direct deployment as an out-of-the-box product. It has not been fine-tuned for downstream tasks and is exclusively for the Portuguese language. Like other large language models, it is subject to hallucinations, biases, and may produce unreliable code or repetitive responses.