aisingapore/Qwen-SEA-LION-v4-4B-VL

VISIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Nov 21, 2025Architecture:Transformer0.0K Cold

Qwen-SEA-LION-v4-4B-VL is a 4-billion parameter Vision-Language Model (VLM) developed by AI Singapore, built upon the Qwen3-VL-4B-Instruct architecture. It features a native 256K context window and is specifically fine-tuned for the Southeast Asia (SEA) region. The model excels in multilingual and multicultural fluency, supporting English and seven key SEA languages: Burmese, Indonesian, Filipino, Malay, Tamil, Thai, and Vietnamese, making it suitable for regional multimodal applications.

Loading preview...

Overview

Qwen-SEA-LION-v4-4B-VL is a 4-billion parameter Vision-Language Model (VLM) developed by AI Singapore, specifically designed for the Southeast Asia (SEA) region. It is built on the Qwen3-VL-4B-Instruct architecture and has undergone extensive supervised fine-tuning (SFT) on approximately 9 million instruction-text pairs to achieve strong domain adaptation.

Key Capabilities

  • Multilingual and Multicultural Fluency: Supports English and seven key SEA languages: Burmese, Indonesian, Filipino, Malay, Tamil, Thai, and Vietnamese.
  • Vision-Language Integration: Inherits enhanced vision-language capabilities and a long-context multimodal architecture from Qwen3-VL, including a native 256K context window.
  • Resource Efficient: Designed for edge-optimized inference, making it suitable for environments with limited resources.
  • Tool Use: Includes capabilities for tool integration.

Evaluation and Performance

The model was evaluated on general language capabilities using the SEA-HELM benchmark, covering tasks like QA, Sentiment Analysis, and Translation. Instruction-following and multi-turn chat were assessed with SEA-IFEval and SEA-MTBench, which were localized and translated for SEA languages. Notably, despite text-focused fine-tuning, the model successfully retained the high-performance vision-language capabilities of its base model, as confirmed by evaluations on Visual Question Answering (VQA) and Image Captioning tasks using SEA-specific datasets.

Limitations

Users should be aware that the model has not been aligned for safety and may exhibit limitations such as hallucination and occasional generation of irrelevant content. It was not tested for robustness against adversarial prompting.