Overview
AIDC-AI/Marco-LLM-AR-V2 is a 7.6 billion parameter base language model from the Marco-LLM-AR series, developed by AIDC-AI. This series is specifically designed and fine-tuned for languages prevalent in the Arab world, encompassing Modern Standard Arabic and various dialects. The model utilizes a Transformer architecture with SwiGLU activation, attention QKV bias, and group query attention, alongside an enhanced tokenizer optimized for diverse Arabic forms.
Key Capabilities
- Arabic Language Specialization: Underwent extensive continued pretraining on approximately 50 billion tokens of Arabic-centric data, significantly enhancing its performance in these languages.
- Architectural Foundation: Built on a robust Transformer architecture, ensuring competitive general language understanding while excelling in its specialized domain.
- Dialect Adaptability: Features an improved tokenizer specifically designed to be adaptive to multiple Arabic dialects and forms, crucial for nuanced language processing.
Usage Recommendations
As a base language model, Marco-LLM-AR-V2 is not intended for direct text generation without further adaptation. Developers are advised to apply post-training methods such as Supervised Fine-tuning (SFT), Reinforcement Learning with Human Feedback (RLHF), or additional continued pretraining to tailor the model for specific downstream tasks and use cases. This approach allows for optimal performance and alignment with desired application requirements.