Model Overview

KomdigiUB-8B-Base, developed by Tim 1 AITF, is an 8 billion parameter Indonesian base language model. It is built upon the Qwen3-8B architecture and employs LoRA (Low-Rank Adaptation) with 4-bit quantization for efficient memory and computation. The model's primary language is Indonesian and it is licensed under Apache-2.0.

Key Characteristics & Training

This model is specifically designed for Continued Pre-Training (CPT), focusing on the domain of digital policy and supervision. Its training data, totaling approximately 214 million tokens, is heavily weighted towards:

Digital Talent Policy (DTP): Covering topics like digital occupation, skill trends, and regulations (43.9% of data).
Digital Space Supervision (PRD): Including online gambling, hoaxes, child protection, and related policies (42.9% of data).
Wikipedia ID: Providing general Indonesian knowledge (13.2% of data).

Training was conducted using bf16 mixed precision and 4-bit quantization, with a LoRA rank of 8. Evaluation results show a final training perplexity of ~3.56 and validation perplexity of ~3.55. General benchmarks include MMLU at ~74.20, IndoMMLU at ~65.66, and XCOPA-ID at ~75.80.

Intended Use Cases

KomdigiUB-8B-Base is recommended for:

Domain adaptation in public policy and digital regulation.
Enriching specific Indonesian knowledge within these domains.
Serving as a pre-adaptation step before further instruction tuning or Supervised Fine-Tuning (SFT).

Users are advised to perform additional evaluation before production deployment and to use the Qwen3 chat template for optimal generation. It is not optimized for long-context conversations or high-stakes decision-making without further fine-tuning.

Overview

Model Overview

Key Characteristics & Training

Intended Use Cases

Full Model Card (README)