Polygl0t/Tucano2-qwen-0.5B-Base

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.8BQuant:BF16Ctx Length:32kPublished:Dec 27, 2025License:apache-2.0Architecture:Transformer Open Weights Warm

Polygl0t/Tucano2-qwen-0.5B-Base is a 0.5 billion parameter decoder-only Transformer model, continually pretrained from Qwen3-0.6B-Base by Polygl0t. Optimized for Portuguese, it uses token embedding transplantation to enhance sensitivity to the language's lexical and morphological properties. This model achieves state-of-the-art performance on Portuguese benchmarks and is intended for research and development in Portuguese language modeling.

Loading preview...

Tucano2-qwen-0.5B-Base: A Specialized Portuguese LLM

Polygl0t/Tucano2-qwen-0.5B-Base is a 0.5 billion parameter decoder-only Transformer model, part of the Polygl0t initiative to advance language models for low-resource languages. It was continually pretrained from Qwen3-0.6B-Base on approximately 50 billion tokens, specifically adapting its tokenizer via Orthogonal Matching Pursuit to better handle the lexical, morphological, and orthographic characteristics of Portuguese.

Key Capabilities & Features

  • Portuguese Language Specialization: Achieves state-of-the-art performance across several benchmarks designed for Portuguese language models, significantly outperforming its base model, Qwen3-0.6B-Base, on both 'Easy Set' and 'Hard Set' evaluations.
  • Reproducible Training: All data, source code, and recipes used for its development are open and fully reproducible, promoting transparency and further research.
  • Continual Pretraining: Benefits from a specialized continual pretraining process, enhancing its understanding and generation capabilities for Portuguese.
  • Small Footprint: With 0.5 billion parameters and a 4,096-token context length, it offers a compact solution for Portuguese NLP tasks.

Intended Use Cases

  • Research and Development: Primarily serves as a foundation for research and development in Portuguese language modeling.
  • Comparative Experiments: Checkpoints saved during training provide a controlled setting for comparative experiments on the effects of continual pretraining.
  • Fine-tuning Base: Can be fine-tuned and adapted for deployment in specific applications, provided users conduct their own risk and bias assessments.

Limitations

  • Not for Direct Deployment: Not intended as an out-of-the-box product for human-facing interactions.
  • Portuguese Only: Unsuitable for text generation tasks in other languages.
  • No Downstream Fine-tuning: Has not been fine-tuned for specific downstream tasks, requiring further adaptation for practical applications.
  • Common LLM Issues: Subject to hallucinations, biases, toxicity, repetition, and verbosity, similar to other large language models trained on web data.