Name: Polygl0t/Tucano2-qwen-1.5B-Base API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Polygl0t

Tucano2-qwen-1.5B-Base: A Specialized Portuguese LLM

Polygl0t/Tucano2-qwen-1.5B-Base is a 1.5 billion parameter decoder-only transformer model, continually pretrained from Qwen3-1.7B-Base. It is part of the Polygl0t initiative, focused on advancing language models for low-resource languages, specifically Portuguese. The model utilizes the same tokenizer as Tucano2-0.6B-Base, with token embedding transplantation via Orthogonal Matching Pursuit to enhance its sensitivity to Portuguese lexical, morphological, and orthographic properties.

Continually pretrained on approximately 50 billion tokens, Tucano2-qwen-1.5B-Base demonstrates state-of-the-art performance on various Portuguese language benchmarks. All development data, source code, and recipes for the Tucano2 series are open and fully reproducible.

Key Capabilities

Portuguese Language Specialization: Optimized for high performance in Portuguese language tasks.
Continual Pretraining: Benefits from extensive continual pretraining on 50 billion tokens, improving upon its base model.
Reproducible Research: Provides open data, source code, and recipes for full reproducibility of its development.
Comparative Experimentation: Checkpoints saved during training enable controlled comparative experiments on continual pretraining effects.

Good For

Research and Development: Ideal as a foundation model for Portuguese language modeling research.
Fine-tuning: Suitable for adaptation and fine-tuning for specific downstream applications in Portuguese, provided risk and bias assessments are conducted.
Benchmarking: Useful for evaluating the impact of continual pretraining on model performance across various benchmarks.

Overview

Tucano2-qwen-1.5B-Base: A Specialized Portuguese LLM

Key Capabilities

Good For

Full Model Card (README)