Qalb-1.0-8B-Instruct: State-of-the-Art Urdu LLM
Qalb-1.0-8B-Instruct is an 8 billion parameter Urdu language model, developed by enstazao, based on the Llama-3.1-8B architecture. It was specifically adapted for Urdu through a two-stage process: continued pre-training on a massive 1.97 billion token Urdu corpus, followed by supervised fine-tuning for instruction following. This model aims to address the gap in low-resource language processing for Urdu, providing fluent, culturally accurate, and context-aware responses that general multilingual models often struggle with.
Key Capabilities
- Deep Urdu Understanding: Trained on diverse Urdu content including news, literature, government documents, and social media.
- Superior Performance: Achieves an overall score of 90.34, outperforming previous state-of-the-art models like Alif-1.0 and LLaMA-3.1 Base in 6 out of 7 benchmark categories for Urdu tasks.
- Reasoning Capable: Demonstrates excellent performance in logical reasoning, mathematical word problems, and commonsense tasks specifically in Urdu.
- Bilingual Proficiency: Maintains strong English language capabilities, making it suitable for translation and code-switching applications.
- Ethical & Safe: Fine-tuned to generate helpful, harmless, and honest content, refusing toxic or misleading outputs.
Ideal Use Cases
- Urdu-centric Applications: Developing chatbots, virtual assistants, or content generation tools specifically for the Urdu language.
- Cross-lingual Tasks: Scenarios requiring translation or code-switching between Urdu and English.
- Research & Development: As a robust baseline or component for further research in low-resource language processing and Urdu NLP.