Biawak-8B-Base: A Domain-Specialized Indonesian LLM

Biawak-8B-Base is an 8-billion-parameter Large Language Model (LLM) developed by AITF Indonesia. It is built on the Qwen-3-8B base model through Continued Pre-training (CPT), specifically adapted for Indonesia's strategic focus areas: Digital Space Protection (PRD) and Digital Talent Pool (DTP).

Key Capabilities & Training

Domain Specialization: Trained on a 214.2 million token Indonesian dataset, with significant portions dedicated to PRD (42.9%) and DTP (~43.9%) topics, alongside general Indonesian Wikipedia data.
Language Focus: Primarily Indonesian, with secondary English language support.
Base Model: Functions as a base causal language model, designed for text completions and adaptable into chat/instruct variants through further fine-tuning.
Training Hardware: Continued pre-training was conducted on NVIDIA A100 80GB GPUs for approximately 36 hours.

Intended Use Cases

This model is designed to provide a sovereign, domain-specialized Indonesian foundation model with strong understanding in:

Digital Space Protection (PRD):
- Policy sentiment analysis
- Misinformation pattern detection
- Understanding legal terminology (e.g., UU ITE, UU PDP)
Digital Talent Pool (DTP):
- Skill gap analysis
- Curriculum drafting assistance
- Job description and talent understanding

Limitations and Recommendations

As a base model, Biawak-8B-Base requires Supervised Fine-Tuning (SFT) for optimal performance in specific PRD/DTP applications. Users should be aware of potential biases inherited from web data and the possibility of factual hallucinations. It is recommended to add high-quality instruction datasets and perform evaluation benchmarks before production deployment.

Overview

Biawak-8B-Base: A Domain-Specialized Indonesian LLM

Key Capabilities & Training

Intended Use Cases

Limitations and Recommendations

Full Model Card (README)