Overview
Polygl0t/Tucano2-qwen-3.7B-Base is a 3.7 billion parameter decoder-only transformer model, part of the Polygl0t initiative to advance language models for low-resource languages. It was continually pretrained from Qwen3-4B-Base on approximately 50 billion tokens, with its tokenizer adapted using Orthogonal Matching Pursuit to enhance sensitivity to Portuguese linguistic properties. All data, source code, and training recipes for the Tucano2 series are open and fully reproducible.
Key Capabilities
- Portuguese Language Specialization: Optimized for the lexical, morphological, and orthographic characteristics of Portuguese.
- Strong Benchmark Performance: Achieves a Total Avg. NPM score of 59.21, outperforming its base model Qwen3-4B-Base (57.86) and Qwen2.5-7B (57.97) on Portuguese-specific benchmarks.
- Research Foundation: Designed to serve as a base for research and development, with checkpoints available for comparative experiments on continual pretraining effects.
- Reproducible Training: All training data, source code, and configurations are publicly available.
Intended Use Cases
- Research and Development: Ideal for exploring and advancing Portuguese language modeling.
- Comparative Experiments: Useful for studying the impact of continual pretraining on model performance.
- Fine-tuning and Adaptation: Can be fine-tuned for specific deployment scenarios under the Apache 2.0 license, with users advised to conduct their own risk assessments.
Limitations
- Not for Direct Deployment: Not intended as an out-of-the-box product for human-facing interactions.
- Portuguese Only: Unsuitable for text generation tasks in other languages.
- Base Model: Has not been fine-tuned for downstream tasks and may exhibit common LLM issues like hallucinations, biases, and repetition.