Model Overview
Tamil-Qwen2.5-7B-Instruct is a specialized large language model developed by Tamil-ai, fine-tuned from the Qwen2.5-7B-Instruct base model. With 7.6 billion parameters, it focuses on enhancing performance for the Tamil language, particularly in instruction-following and linguistic tasks. The model was trained using QLoRA on a comprehensive dataset of 150,000 deduplicated Tamil instruction-response pairs, including data from Tamil Alpaca, Tamil Orca, Tamil Dolly, and specialized morphological drills.
Key Differentiators
- Optimized Tamil Tokenization: Built on Qwen2.5, which demonstrates a 4.62x Tamil token ratio, making it significantly more efficient for processing Tamil compared to other base models like Llama 3.1 (5.8x) or Mistral (7.2x).
- Specialized Training Data: Fine-tuned on a diverse set of Tamil instruction pairs, including specific morphological and grammar QA data, to improve linguistic understanding.
Intended Use Cases
- Tamil Question Answering: Excels at understanding and responding to queries in Tamil.
- Morphological Analysis: Designed for tasks involving the analysis of Tamil word structures.
- Grammar and Linguistics: Suitable for research and applications requiring deep understanding of Tamil grammar.
- Low-Resource Language Research: A valuable tool for exploring and developing LLMs for languages with limited digital resources.
Limitations
- Performance may be reduced for colloquial or slang Tamil due to its primary training on instructional content.
- English language capabilities might be degraded compared to the original Qwen2.5-7B-Instruct base model.