Name: HiTZ/gl_Llama-3.1-8B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: HiTZ

HiTZ/gl_Llama-3.1-8B: A Galician Language Base Model

HiTZ/gl_Llama-3.1-8B is an 8 billion parameter base language model developed by the HiTZ Research Center. It is specifically designed for the Galician language, building upon the robust Llama 3.1 architecture.

Key Characteristics

Language-Specific Pretraining: The model underwent further pretraining on a curated dataset of approximately 3.5 billion Galician tokens, alongside a small English subset to mitigate catastrophic forgetting.
Base Model Design: Released as a base model, its primary purpose is to serve as a foundation for further fine-tuning, instruction tuning, or domain adaptation, rather than direct out-of-the-box application.
Training Data: Galician data was sourced from the CorpusNÓS corpus, which includes large-scale web crawls and public administration texts. The English subset was sampled from the FineWeb corpus.
Training Configuration: Trained with a sequence length of 8,196 tokens and an effective batch size of 256 sequences, utilizing a cosine decay learning rate schedule.

Intended Use

This model is ideal for developers and researchers looking to:

Develop Galician-specific NLP applications requiring a strong language foundation.
Fine-tune for specialized tasks such as chatbots, summarization, or translation in Galician.
Experiment with instruction tuning or domain adaptation for low-resource languages, following methodologies like those proposed by Etxaniz et al. (2024).

Overview

HiTZ/gl_Llama-3.1-8B: A Galician Language Base Model

Key Characteristics

Intended Use

Full Model Card (README)