balgeet/Gurmukh-370M-base
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:2kPublished:Apr 11, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

Gurmukh-370M-base is a 370-million-parameter GPT-2-architecture causal language model developed by balgeet, specifically trained from scratch on Punjabi text. It is the first openly released GPT-2-scale base model dedicated to the Punjabi language, supporting both Gurmukhi script and Romanized Punjabi with a 2048-token context length. This model excels in generating Punjabi text, particularly in Gurmukhi script, and serves as a foundational model for further fine-tuning in Punjabi NLP research and downstream applications.

Loading preview...

Gurmukh-370M-base: A Punjabi Language Model

Gurmukh-370M-base is a 370-million-parameter causal language model based on the GPT-2 architecture, developed by balgeet. It is the first openly released model of its scale specifically trained on Punjabi text, supporting both Gurmukhi script and Romanized Punjabi. The model utilizes a custom 64,000-token SentencePiece BPE tokenizer, which achieves high efficiency for Gurmukhi script with a mean fertility of 1.105 tokens per word.

Key Capabilities & Training

  • Language Focus: Dedicated to Punjabi, trained on approximately 13.8 GB of deduplicated Gurmukhi and Romanized Punjabi text from the Sangraha dataset, totaling 2.5 billion tokens.
  • Architecture: GPT-2 with 24 layers, 1024 hidden size, and 16 attention heads, supporting a 2048-token context length.
  • Performance: Achieves a strong perplexity of 12.65 on news domain Punjabi text, indicating proficiency in formal language generation.
  • Code-Mixing: Naturally handles code-mixed Punjabi (Gurmukhi + English terms) in generation.

Intended Use Cases

  • Punjabi NLP Research: Ideal for text generation, language understanding, and probing studies in Punjabi.
  • Foundation Model: Designed as a base model for supervised fine-tuning (SFT) to create instruction-following, chat, or question-answering systems.
  • Downstream Tasks: Suitable for fine-tuning on tasks like sentiment analysis, summarization, and Named Entity Recognition (NER) in Punjabi.
  • Voice Pipelines: Can be integrated into spoken Punjabi interfaces when combined with ASR and TTS systems.

Limitations

As a base model, Gurmukh is not instruction-tuned or safety-aligned, meaning it may not follow instructions reliably and can produce harmful or factually incorrect text. Its performance on conversational or QA-style prompts is weaker due to the training corpus's predominantly formal nature. Romanized Punjabi generation quality is also lower compared to Gurmukhi.