Model Overview
abhinand/gemma-2b-tamil is an experimental 2.6 billion parameter foundational language model, continually pretrained from Google's Gemma 2B. Developed by Abhinand Balachandran, this model aims to adapt Gemma for the Tamil language without expanding its original vocabulary, making it a bilingual model supporting both English and Tamil.
Key Capabilities & Training
- Bilingual Support: Designed for causal language modeling in both English and Tamil.
- Continual Pretraining: The Gemma base model was continually pretrained on all available Tamil Wikipedia data for 3 epochs.
- Finetuning: Subsequently finetuned on a mix of English and Tamil Alpaca datasets for 5 epochs (the instruction-tuned version is available separately).
- Experimental Nature: This is an alpha release, with potential for improved performance through extended pretraining on larger datasets like CulturaX.
- Training Details: Trained in
bfloat16 precision on 4x Nvidia RTX 3090 GPUs.
Performance & Limitations
While experimental, the model shows promise for a 2B parameter size. Evaluation results from the Open LLM Leaderboard indicate an average score of 45.13, with specific scores including HellaSwag (10-Shot) at 71.30 and MMLU (5-Shot) at 38.21. Users should note that the model has not undergone detoxification and may generate harmful or offensive content, requiring discretion and supervision.