mashriram/Sarvam-1-VL-4B-Instruct-VLLM

VISIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 4, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

mashriram/Sarvam-1-VL-4B-Instruct-VLLM is a 4 billion parameter vision-language instruction-tuned model, fine-tuned from Qwen3-VL-4B-Instruct. It specializes in multilingual document understanding, translation, and visual question answering across 22 Indic languages and English. Optimized for VLLM inference, this model offers a balance of accuracy and speed for document-centric visual language tasks.

Loading preview...

Sarvam-1-VL-4B-Instruct-VLLM Overview

This model, developed by mashriram, is a 4 billion parameter vision-language instruction-tuned model, specifically the recommended version for inference. It is built upon the Qwen3-VL-4B-Instruct base model and enhanced through LoRA fine-tuning over 2,000 steps, resulting in a merged 16-bit model.

Key Capabilities & Training

Sarvam-1-VL-4B-Instruct-VLLM is trained on a diverse set of four datasets, focusing on:

  • Multilingual Translation: 40% of its training data covers translation between 22 Indic languages and English (BPCC dataset).
  • Instruction Following: 20% of training involves instruction following across 11 language pairs (Pralekha dataset).
  • Document Understanding: 30% is dedicated to document layout analysis (IndicDLP dataset).
  • Visual Question Answering (VQA): 10% focuses on VQA tasks, particularly for documents (DocVQA dataset).

This specialized training enables the model to handle complex visual and linguistic tasks, especially those involving documents and multiple languages. It supports a wide array of languages including Assamese, Bengali, Hindi, Kannada, Tamil, Telugu, Urdu, and English, among others.

Performance & Usage

Optimized for VLLM serving, the model requires approximately 8-9GB of VRAM (fp16) and aims for a balanced performance in terms of accuracy and inference speed. Developers can integrate it using the transformers library for tasks like document translation from images, as demonstrated in the provided usage examples. The model is released under the Apache 2.0 License.