AIDC-AI/Marco-LLM-ES

Warm
Public
7.6B
FP8
32768
1
Jan 10, 2025
License: apache-2.0
Hugging Face

AIDC-AI's Marco-LLM-ES is a 7.6 billion parameter Transformer-based language model specifically fine-tuned for common languages used in Spain, including Catalan, Basque, Galician, and Spanish. It has undergone extensive continued pretraining on approximately 50 billion tokens to enhance its capabilities in these targeted languages. The model features an improved tokenizer adaptive to multiple languages and is intended for further adaptation via post-training methods like SFT or RLHF for specific use cases.

Overview

Overview

Marco-LLM-ES is a series of language models developed by AIDC-AI, specifically enhanced for languages prevalent in Spain: Catalan, Basque, Galician, and Spanish. This 7.6 billion parameter base model has undergone significant continued pretraining on a 50 billion token dataset, focusing on improving its performance in these regional languages while maintaining general benchmark competitiveness.

Key Capabilities

  • Multilingual Specialization: Optimized for Catalan, Basque, Galician, and Spanish through extensive continued pretraining.
  • Transformer Architecture: Utilizes a Transformer architecture with SwiGLU activation, attention QKV bias, and group query attention.
  • Adaptive Tokenizer: Features an improved tokenizer designed for multiple languages.
  • Scalable Series: Part of a larger series ranging from 7B to 72B parameters, including both base and instruction-tuned variants.

Usage Recommendations

This base model is not recommended for direct text generation. Developers should apply post-training methods such as Supervised Fine-tuning (SFT), Reinforcement Learning with Human Feedback (RLHF), or further continued pretraining to adapt it for specific applications.

Performance Highlights

Evaluations show enhanced performance in Spanish-specific tasks, with the 7B model achieving an average score of 34.16 across Spanish, Catalan, Basque, and Galician benchmarks on LaLeaderboard (5-shot).