Tucano2-qwen-1.5B-Base: A Specialized Portuguese LLM
Polygl0t/Tucano2-qwen-1.5B-Base is a 1.5 billion parameter decoder-only transformer model, continually pretrained from Qwen3-1.7B-Base. It is part of the Polygl0t initiative, focused on advancing language models for low-resource languages, specifically Portuguese. The model utilizes the same tokenizer as Tucano2-0.6B-Base, with token embedding transplantation via Orthogonal Matching Pursuit to enhance its sensitivity to Portuguese lexical, morphological, and orthographic properties.
Continually pretrained on approximately 50 billion tokens, Tucano2-qwen-1.5B-Base demonstrates state-of-the-art performance on various Portuguese language benchmarks. All development data, source code, and recipes for the Tucano2 series are open and fully reproducible.
Key Capabilities
- Portuguese Language Specialization: Optimized for high performance in Portuguese language tasks.
- Continual Pretraining: Benefits from extensive continual pretraining on 50 billion tokens, improving upon its base model.
- Reproducible Research: Provides open data, source code, and recipes for full reproducibility of its development.
- Comparative Experimentation: Checkpoints saved during training enable controlled comparative experiments on continual pretraining effects.
Good For
- Research and Development: Ideal as a foundation model for Portuguese language modeling research.
- Fine-tuning: Suitable for adaptation and fine-tuning for specific downstream applications in Portuguese, provided risk and bias assessments are conducted.
- Benchmarking: Useful for evaluating the impact of continual pretraining on model performance across various benchmarks.