Marco-LLM-GLO: Bridging Multilingual Performance Gaps
Marco-LLM-GLO is a 7.6 billion parameter multilingual language model developed by AIDC-AI, built upon the Transformer architecture. Its core innovation lies in extensive continual pretraining on a massive dataset exceeding 5 trillion tokens, with a strategic focus on enhancing performance in low-resource languages while maintaining strong capabilities in high-resource languages like English and Chinese.
Key Capabilities & Features
- Multilingual Training: Trained on a diverse dataset covering 29 languages, including both high-resource (e.g., English, Chinese) and low-resource languages (e.g., Kazakh, Nepali).
- Enhanced Tokenizer: Incorporates an improved tokenizer specifically designed for better handling and higher accuracy with multilingual data.
- Performance: Demonstrates significant improvements in multilingual tasks such as machine translation, question answering, and cross-lingual reasoning compared to other open-source models.
- Post-Training Support: Designed to support various post-training methods like Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) for task-specific and language-specific enhancements.
When to Use Marco-LLM-GLO
This base model is primarily intended for further adaptation through post-training methods such as SFT, RLHF, or continued pretraining. It is particularly well-suited for applications requiring robust multilingual understanding and generation, especially in scenarios involving a mix of high and low-resource languages. Developers should fine-tune Marco-LLM-GLO for specific downstream tasks rather than using the base model directly for text generation.