aisingapore/Qwen-SEA-LION-v4-8B-VL
Qwen-SEA-LION-v4-8B-VL is an 8-billion parameter Vision-Language Model developed by AI Singapore, built upon the Qwen3-VL-8B-Instruct architecture. It features a native 256K context window and is specifically fine-tuned for multilingual and multicultural fluency across English and seven key Southeast Asian languages. This model excels in domain adaptation for the SEA region, offering enhanced vision-language capabilities and tool use.
Loading preview...
Qwen-SEA-LION-v4-8B-VL: Southeast Asian Multilingual Vision-Language Model
Qwen-SEA-LION-v4-8B-VL, developed by AI Singapore, is an 8-billion parameter Vision-Language Model (VLM) derived from the Qwen3-VL-8B-Instruct architecture. It is specifically designed for the Southeast Asian (SEA) region, undergoing extensive supervised fine-tuning on approximately 9 million instruction-text pairs to achieve strong domain adaptation.
Key Capabilities
- Multilingual & Multicultural Fluency: Supports English and seven key SEA languages: Burmese, Indonesian, Filipino, Malay, Tamil, Thai, and Vietnamese.
- Vision-Language Integration: Inherits and retains robust vision-language capabilities from its Qwen3-VL base, including Visual Question Answering (VQA) and Image Captioning, despite text-focused fine-tuning.
- Long-Context Multimodal Architecture: Features a native 256K context window, enabling processing of extensive multimodal inputs.
- Edge-Optimized Inference: Designed for resource efficiency, making it suitable for various deployment scenarios.
- Tool Use: Incorporates tool-use functionalities, enhancing its utility for complex tasks.
Evaluation and Performance
The model was evaluated on general language capabilities using the SEA-HELM evaluation benchmark, covering tasks like QA, Sentiment Analysis, and Translation. Instruction-following and multi-turn chat capabilities were assessed with SEA-IFEval and SEA-MTBench, which were localized and translated for SEA languages. Notably, evaluations confirmed the successful retention of high-performance vision-language capabilities, benchmarked against original base models on SEA-specific VQA and image captioning datasets.
Good For
- Applications requiring strong multilingual understanding and generation in Southeast Asian languages.
- Multimodal tasks involving both text and images, particularly within a Southeast Asian cultural context.
- Developers seeking a vision-language model optimized for regional relevance and efficiency.