Polygl0t/Tucano2-qwen-0.5B-Base
Polygl0t/Tucano2-qwen-0.5B-Base is a 0.5 billion parameter decoder-only Transformer model, continually pretrained from Qwen3-0.6B-Base by Polygl0t. Optimized for Portuguese, it uses token embedding transplantation to enhance sensitivity to the language's lexical and morphological properties. This model achieves state-of-the-art performance on Portuguese benchmarks and is intended for research and development in Portuguese language modeling.
Loading preview...
Tucano2-qwen-0.5B-Base: A Specialized Portuguese LLM
Polygl0t/Tucano2-qwen-0.5B-Base is a 0.5 billion parameter decoder-only Transformer model, part of the Polygl0t initiative to advance language models for low-resource languages. It was continually pretrained from Qwen3-0.6B-Base on approximately 50 billion tokens, specifically adapting its tokenizer via Orthogonal Matching Pursuit to better handle the lexical, morphological, and orthographic characteristics of Portuguese.
Key Capabilities & Features
- Portuguese Language Specialization: Achieves state-of-the-art performance across several benchmarks designed for Portuguese language models, significantly outperforming its base model, Qwen3-0.6B-Base, on both 'Easy Set' and 'Hard Set' evaluations.
- Reproducible Training: All data, source code, and recipes used for its development are open and fully reproducible, promoting transparency and further research.
- Continual Pretraining: Benefits from a specialized continual pretraining process, enhancing its understanding and generation capabilities for Portuguese.
- Small Footprint: With 0.5 billion parameters and a 4,096-token context length, it offers a compact solution for Portuguese NLP tasks.
Intended Use Cases
- Research and Development: Primarily serves as a foundation for research and development in Portuguese language modeling.
- Comparative Experiments: Checkpoints saved during training provide a controlled setting for comparative experiments on the effects of continual pretraining.
- Fine-tuning Base: Can be fine-tuned and adapted for deployment in specific applications, provided users conduct their own risk and bias assessments.
Limitations
- Not for Direct Deployment: Not intended as an out-of-the-box product for human-facing interactions.
- Portuguese Only: Unsuitable for text generation tasks in other languages.
- No Downstream Fine-tuning: Has not been fine-tuned for specific downstream tasks, requiring further adaptation for practical applications.
- Common LLM Issues: Subject to hallucinations, biases, toxicity, repetition, and verbosity, similar to other large language models trained on web data.