Gurmukh-370M-base is a 370-million-parameter GPT-2-architecture causal language model developed by balgeet, specifically trained from scratch on Punjabi text. It is the first openly released GPT-2-scale base model dedicated to the Punjabi language, supporting both Gurmukhi script and Romanized Punjabi with a 2048-token context length. This model excels in generating Punjabi text, particularly in Gurmukhi script, and serves as a foundational model for further fine-tuning in Punjabi NLP research and downstream applications.
Loading preview...
Gurmukh-370M-base: A Punjabi Language Model
Gurmukh-370M-base is a 370-million-parameter causal language model based on the GPT-2 architecture, developed by balgeet. It is the first openly released model of its scale specifically trained on Punjabi text, supporting both Gurmukhi script and Romanized Punjabi. The model utilizes a custom 64,000-token SentencePiece BPE tokenizer, which achieves high efficiency for Gurmukhi script with a mean fertility of 1.105 tokens per word.
Key Capabilities & Training
- Language Focus: Dedicated to Punjabi, trained on approximately 13.8 GB of deduplicated Gurmukhi and Romanized Punjabi text from the Sangraha dataset, totaling 2.5 billion tokens.
- Architecture: GPT-2 with 24 layers, 1024 hidden size, and 16 attention heads, supporting a 2048-token context length.
- Performance: Achieves a strong perplexity of 12.65 on news domain Punjabi text, indicating proficiency in formal language generation.
- Code-Mixing: Naturally handles code-mixed Punjabi (Gurmukhi + English terms) in generation.
Intended Use Cases
- Punjabi NLP Research: Ideal for text generation, language understanding, and probing studies in Punjabi.
- Foundation Model: Designed as a base model for supervised fine-tuning (SFT) to create instruction-following, chat, or question-answering systems.
- Downstream Tasks: Suitable for fine-tuning on tasks like sentiment analysis, summarization, and Named Entity Recognition (NER) in Punjabi.
- Voice Pipelines: Can be integrated into spoken Punjabi interfaces when combined with ASR and TTS systems.
Limitations
As a base model, Gurmukh is not instruction-tuned or safety-aligned, meaning it may not follow instructions reliably and can produce harmful or factually incorrect text. Its performance on conversational or QA-style prompts is weaker due to the training corpus's predominantly formal nature. Romanized Punjabi generation quality is also lower compared to Gurmukhi.