Model Overview

KomdigiUB-8B-Base, developed by Tim 1 AITF, is an 8 billion parameter Indonesian causal language model. It is built upon the Qwen3-8B architecture and employs LoRA (Low-Rank Adaptation) with 4-bit quantization for efficient training and deployment. The model's primary language is Indonesian and it is licensed under Apache-2.0.

Key Capabilities & Training

This model is specifically designed for Continued Pre-Training (CPT), focusing on the domain of digital policy and oversight. Its training data, totaling approximately 214 million tokens, is heavily weighted towards Digital Talent Policy (DTP) and Pengawasan Ruang Digital (PRD), alongside general Indonesian knowledge from Wikipedia. The training procedure involved bf16 precision, 4-bit quantization, and an effective batch size of 32 over 1 epoch. Evaluation metrics show a final validation loss of ~1.264 and a validation perplexity of ~3.55. Benchmarks include ~65.66 on IndoMMLU and ~75.80 on XCOPA-ID.

Intended Use Cases

Domain Adaptation: Ideal for adapting to public policy and digital regulation domains in Indonesia.
Knowledge Enrichment: Useful for enriching specific Indonesian knowledge bases.
Pre-adaptation: Serves as a strong base for further Instruction Tuning or Supervised Fine-Tuning (SFT) before deployment to end-users.

Users are recommended to use the Qwen3 chat template and perform additional fine-tuning for chat-oriented instruction following or long-context conversations, as the model has not undergone preference alignment.

Overview

Model Overview

Key Capabilities & Training

Intended Use Cases

Full Model Card (README)