Name: balgeet/Gurmukh-370M-base API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: balgeet

Gurmukh-370M-base: A Punjabi Language Model

Gurmukh-370M-base is a 370-million-parameter causal language model based on the GPT-2 architecture, developed by balgeet. It is the first openly released model of its scale specifically trained on Punjabi text, supporting both Gurmukhi script and Romanized Punjabi. The model utilizes a custom 64,000-token SentencePiece BPE tokenizer, which achieves high efficiency for Gurmukhi script with a mean fertility of 1.105 tokens per word.

Key Capabilities & Training

Language Focus: Dedicated to Punjabi, trained on approximately 13.8 GB of deduplicated Gurmukhi and Romanized Punjabi text from the Sangraha dataset, totaling 2.5 billion tokens.
Architecture: GPT-2 with 24 layers, 1024 hidden size, and 16 attention heads, supporting a 2048-token context length.
Performance: Achieves a strong perplexity of 12.65 on news domain Punjabi text, indicating proficiency in formal language generation.
Code-Mixing: Naturally handles code-mixed Punjabi (Gurmukhi + English terms) in generation.

Intended Use Cases

Punjabi NLP Research: Ideal for text generation, language understanding, and probing studies in Punjabi.
Foundation Model: Designed as a base model for supervised fine-tuning (SFT) to create instruction-following, chat, or question-answering systems.
Downstream Tasks: Suitable for fine-tuning on tasks like sentiment analysis, summarization, and Named Entity Recognition (NER) in Punjabi.
Voice Pipelines: Can be integrated into spoken Punjabi interfaces when combined with ASR and TTS systems.

Limitations

As a base model, Gurmukh is not instruction-tuned or safety-aligned, meaning it may not follow instructions reliably and can produce harmful or factually incorrect text. Its performance on conversational or QA-style prompts is weaker due to the training corpus's predominantly formal nature. Romanized Punjabi generation quality is also lower compared to Gurmukhi.

Overview

Gurmukh-370M-base: A Punjabi Language Model

Key Capabilities & Training

Intended Use Cases

Limitations

Full Model Card (README)