aisingapore/Qwen-SEA-LION-v4-4B-VL
Qwen-SEA-LION-v4-4B-VL is a 4-billion parameter Vision-Language Model (VLM) developed by AI Singapore, built upon the Qwen3-VL-4B-Instruct architecture. It features a native 256K context window and is optimized for edge inference. This model is specifically fine-tuned on approximately 9 million instruction-text pairs to achieve multilingual and multicultural fluency across English and seven key Southeast Asian languages: Burmese, Indonesian, Filipino, Malay, Tamil, Thai, and Vietnamese. Its primary strength lies in domain adaptation for the SEA region while retaining strong vision-language capabilities.
Loading preview...
Overview
Qwen-SEA-LION-v4-4B-VL is a 4-billion parameter Vision-Language Model (VLM) developed by AI Singapore, specifically designed for the Southeast Asian (SEA) region. It is built on the Qwen3-VL-4B-Instruct architecture and has undergone extensive supervised fine-tuning (SFT) using approximately 9 million instruction-text pairs. This post-training process instills strong multilingual and multicultural fluency, covering English and seven key SEA languages: Burmese, Indonesian, Filipino, Malay, Tamil, Thai, and Vietnamese.
Key Capabilities
- Multilingual and Multicultural Fluency: Fine-tuned for English and 7 SEA languages, making it highly relevant for regional applications.
- Vision-Language Model (VLM): Inherits enhanced vision-language capabilities from the Qwen3-VL architecture, including Visual Question Answering (VQA) and Image Captioning.
- Long-Context Multimodal Architecture: Features a native 256K context window, supporting complex multimodal inputs.
- Edge-Optimized Inference: Designed for resource-efficient deployment.
- Tool Use: Supports tool use functionalities.
Evaluation and Performance
The model was evaluated on general language capabilities using the SEA-HELM evaluation benchmark, covering tasks like QA, Sentiment Analysis, and Translation. Instruction-following and multi-turn chat capabilities were assessed with SEA-IFEval and SEA-MTBench, respectively. Notably, despite text-only fine-tuning, the model successfully retains the high-performance vision-language capabilities of its base model, as confirmed by evaluations on VQA and Image Captioning tasks using SEA-specific datasets.
Good for
- Applications requiring strong language understanding and generation in Southeast Asian languages.
- Multimodal tasks involving both text and images, particularly within a SEA context.
- Use cases where a long context window for multimodal input is beneficial.
- Deployment in resource-constrained environments due to its edge-optimized design.