guaran-ia/gntweets-lm
The guaran-ia/gntweets-lm is a 9 billion parameter Gemma2-based causal language model developed by guaran-ia, fine-tuned on a publicly available corpus of Guarani and Jopara tweets. This model is specifically designed for computing the perplexity score of Guarani text, serving as a validation tool for text quality rather than a generative AI. It features a 16384-token context length and is optimized to identify text predictability and similarity to high-quality Guarani corpora.
Loading preview...
GNTweetsLM: Guarani Text Quality Validation Model
GNTweetsLM is a 9 billion parameter language model developed by guaran-ia, built upon the Gemma2ForCausalLM architecture. Unlike typical generative LLMs, its primary purpose is to validate the quality of Guarani text by computing perplexity scores.
Key Capabilities
- Perplexity Computation: Designed to calculate the perplexity of Guarani documents, indicating text predictability and similarity to a high-quality reference corpus.
- Guarani and Jopara Expertise: Fine-tuned on a specialized corpus of tweets in Guarani and Jopara (Góngora et al., 2021).
- Full Fine-tuning: All model weights were updated during training, based on
princeton-nlp/gemma-2-9b-it-SimPO. - Long Context Support: Features a maximum context length of 8192 tokens, with a provided method for perplexity calculation over longer texts using a sliding window approach.
Good For
- Guarani Text Quality Assessment: Ideal for researchers and developers needing to programmatically evaluate the quality or naturalness of written Guarani.
- Linguistic Research: Useful for studies involving the Guarani language, particularly in understanding text predictability and corpus characteristics.
Limitations
- Not Generative: This model is explicitly not intended for text generation; its utility is confined to perplexity calculation.
- Bias Reflection: May reflect biases present in its training corpus of tweets.