Name: mashriram/Sarvam-1-VL-4B-Instruct-VLLM API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: mashriram

Sarvam-1-VL-4B-Instruct-VLLM Overview

This model, developed by mashriram, is a 4 billion parameter vision-language instruction-tuned model, specifically the recommended version for inference. It is built upon the Qwen3-VL-4B-Instruct base model and enhanced through LoRA fine-tuning over 2,000 steps, resulting in a merged 16-bit model.

Key Capabilities & Training

Sarvam-1-VL-4B-Instruct-VLLM is trained on a diverse set of four datasets, focusing on:

Multilingual Translation: 40% of its training data covers translation between 22 Indic languages and English (BPCC dataset).
Instruction Following: 20% of training involves instruction following across 11 language pairs (Pralekha dataset).
Document Understanding: 30% is dedicated to document layout analysis (IndicDLP dataset).
Visual Question Answering (VQA): 10% focuses on VQA tasks, particularly for documents (DocVQA dataset).

This specialized training enables the model to handle complex visual and linguistic tasks, especially those involving documents and multiple languages. It supports a wide array of languages including Assamese, Bengali, Hindi, Kannada, Tamil, Telugu, Urdu, and English, among others.

Performance & Usage

Optimized for VLLM serving, the model requires approximately 8-9GB of VRAM (fp16) and aims for a balanced performance in terms of accuracy and inference speed. Developers can integrate it using the transformers library for tasks like document translation from images, as demonstrated in the provided usage examples. The model is released under the Apache 2.0 License.

Overview

Sarvam-1-VL-4B-Instruct-VLLM Overview

Key Capabilities & Training

Performance & Usage

Full Model Card (README)