SEACrowd/Gemma-SEA-LION-v4-27B-VL

Cold
Public
Vision
27B
FP8
32768
License: gemma
Hugging Face
Overview

Gemma-SEA-LION-v4-27B-VL: Vision-Text Model for Southeast Asia

Gemma-SEA-LION-v4-27B-VL is a 27 billion parameter instruct-tuned vision-text model developed by SEACrowd and AI Singapore. It is built upon the Gemma 3 architecture, inheriting its large 128K context length and robust image and text understanding capabilities, including document comprehension, visual Q&A, and image-grounded reasoning. The model also supports advanced function calling and structured outputs for seamless system integration.

Key Capabilities & Differentiators

  • Multilingual Vision-Text Understanding: Post-trained on approximately 540k instruction-image pairs in Burmese, English, Indonesian, Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai, and Vietnamese.
  • Southeast Asian Task Optimization: Excels at tasks specific to the Southeast Asian region, demonstrating performance comparable to larger closed models and outperforming other open models under 200 billion parameters as of October 2025.
  • Comprehensive Vision-Text Features: Capable of visual question answering, image captioning, and image-grounded reasoning.

Use Cases & Limitations

This model is particularly well-suited for applications requiring deep cultural and visual understanding within Southeast Asian contexts. It has been evaluated on VQA tasks (MARVL, CVQA, WorldCuisines) and image captioning (XM3600) with a focus on SEA examples. While strong in vision-text, its text-only capabilities are comparable to its base model, Gemma-SEA-LION-v4-27B-IT, without significant improvements in that area. Users should be aware of potential hallucinations and the need for safety fine-tuning, as the model has not been aligned for safety.