Qwen-SEA-LION-v4-8B-VL: Southeast Asian Multilingual Vision-Language Model

Qwen-SEA-LION-v4-8B-VL, developed by AI Singapore, is an 8-billion parameter Vision-Language Model (VLM) derived from the Qwen3-VL-8B-Instruct architecture. It is specifically designed for the Southeast Asian (SEA) region, undergoing extensive supervised fine-tuning on approximately 9 million instruction-text pairs to achieve strong domain adaptation.

Key Capabilities

Multilingual & Multicultural Fluency: Supports English and seven key SEA languages: Burmese, Indonesian, Filipino, Malay, Tamil, Thai, and Vietnamese.
Vision-Language Integration: Inherits and retains robust vision-language capabilities from its Qwen3-VL base, including Visual Question Answering (VQA) and Image Captioning, despite text-focused fine-tuning.
Long-Context Multimodal Architecture: Features a native 256K context window, enabling processing of extensive multimodal inputs.
Edge-Optimized Inference: Designed for resource efficiency, making it suitable for various deployment scenarios.
Tool Use: Incorporates tool-use functionalities, enhancing its utility for complex tasks.

Evaluation and Performance

The model was evaluated on general language capabilities using the SEA-HELM evaluation benchmark, covering tasks like QA, Sentiment Analysis, and Translation. Instruction-following and multi-turn chat capabilities were assessed with SEA-IFEval and SEA-MTBench, which were localized and translated for SEA languages. Notably, evaluations confirmed the successful retention of high-performance vision-language capabilities, benchmarked against original base models on SEA-specific VQA and image captioning datasets.

Good For

Applications requiring strong multilingual understanding and generation in Southeast Asian languages.
Multimodal tasks involving both text and images, particularly within a Southeast Asian cultural context.
Developers seeking a vision-language model optimized for regional relevance and efficiency.

Overview

Qwen-SEA-LION-v4-8B-VL: Southeast Asian Multilingual Vision-Language Model

Key Capabilities

Evaluation and Performance

Good For

Full Model Card (README)