Model Overview
The sapinsapin/llama31-8b-balitanlp-cpt is an 8 billion parameter language model based on the Meta Llama-3.1 architecture. This model has undergone continuous pretraining specifically on Filipino news articles sourced from the LanceBunag/BalitaNLP dataset. The pretraining process aimed to adapt the base Llama-3.1 model to better understand and generate text in Filipino.
Key Characteristics
- Base Model: Meta Llama-3.1-8B.
- Continuous Pretraining: Focused on Filipino news articles.
- Context Length: Supports a context length of 32768 tokens.
- Training Data: Utilized the BalitaNLP dataset, consisting of Filipino news articles.
- Training Parameters: Trained with a learning rate of 3e-05, batch size of 1 (accumulated to 8), and 100 training steps.
Intended Use Cases
This model is primarily a continuously pretrained base model, meaning it has learned aspects of the Filipino language but is not yet instruction-tuned for specific tasks. It is suitable for:
- Further Fine-tuning: Serving as a strong foundation for instruction tuning on various downstream tasks in Filipino.
- Filipino Language Understanding: Applications requiring a deeper understanding of Filipino text, particularly in the news domain.
- Research and Development: Exploring the effects of continuous pretraining on specific languages and datasets.