aisingapore/Qwen-SEA-LION-v4-4B-VL

VISIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Nov 21, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

Qwen-SEA-LION-v4-4B-VL is a 4-billion parameter Vision-Language Model (VLM) developed by AI Singapore, built upon the Qwen3-VL-4B-Instruct architecture. It features a native 256K context window and is optimized for edge inference. This model is specifically fine-tuned on approximately 9 million instruction-text pairs to achieve multilingual and multicultural fluency across English and seven key Southeast Asian languages: Burmese, Indonesian, Filipino, Malay, Tamil, Thai, and Vietnamese. Its primary strength lies in domain adaptation for the SEA region while retaining strong vision-language capabilities.

Loading preview...

Overview

Qwen-SEA-LION-v4-4B-VL is a 4-billion parameter Vision-Language Model (VLM) developed by AI Singapore, specifically designed for the Southeast Asian (SEA) region. It is built on the Qwen3-VL-4B-Instruct architecture and has undergone extensive supervised fine-tuning (SFT) using approximately 9 million instruction-text pairs. This post-training process instills strong multilingual and multicultural fluency, covering English and seven key SEA languages: Burmese, Indonesian, Filipino, Malay, Tamil, Thai, and Vietnamese.

Key Capabilities

  • Multilingual and Multicultural Fluency: Fine-tuned for English and 7 SEA languages, making it highly relevant for regional applications.
  • Vision-Language Model (VLM): Inherits enhanced vision-language capabilities from the Qwen3-VL architecture, including Visual Question Answering (VQA) and Image Captioning.
  • Long-Context Multimodal Architecture: Features a native 256K context window, supporting complex multimodal inputs.
  • Edge-Optimized Inference: Designed for resource-efficient deployment.
  • Tool Use: Supports tool use functionalities.

Evaluation and Performance

The model was evaluated on general language capabilities using the SEA-HELM evaluation benchmark, covering tasks like QA, Sentiment Analysis, and Translation. Instruction-following and multi-turn chat capabilities were assessed with SEA-IFEval and SEA-MTBench, respectively. Notably, despite text-only fine-tuning, the model successfully retains the high-performance vision-language capabilities of its base model, as confirmed by evaluations on VQA and Image Captioning tasks using SEA-specific datasets.

Good for

  • Applications requiring strong language understanding and generation in Southeast Asian languages.
  • Multimodal tasks involving both text and images, particularly within a SEA context.
  • Use cases where a long context window for multimodal input is beneficial.
  • Deployment in resource-constrained environments due to its edge-optimized design.