Next 4B: Türkiye's First Vision-Language Model

Next 4B is a 4.3 billion parameter multimodal Vision-Language Model (VLM) developed by Lamapi, built upon the Gemma 3 architecture. It stands out as Türkiye's first open-source VLM, designed for efficient handling of both text and images. The model is specifically fine-tuned for robust reasoning and generating context-aware multimodal outputs.

Key Capabilities

Multimodal Understanding: Processes and generates content from both text and image inputs.
Efficient Deployment: Optimized for low VRAM environments, supporting 8-bit quantization for consumer-grade GPUs.
Multilingual Support: Offers strong capabilities in Turkish, alongside broader multilingual understanding.
Advanced Reasoning: Excels in logical and analytical reasoning tasks across modalities.
Consistent Outputs: Provides reliable and reproducible responses.

Performance Highlights

Next 4B demonstrates competitive performance, particularly in mathematical reasoning, achieving 82.7% on GSM8K and 70.5% on MATH. While its MMLU (5-shot) score is 84.6%, its smaller counterpart, Next 1B, shows even higher scores in MMLU and GSM8K, indicating strong performance within the tiny model category.

Good For

Image Captioning and Multimodal QA: Generating descriptions for images and answering questions based on visual and textual context.
Text Generation and Reasoning: Creating coherent text and performing analytical tasks.
Creative Storytelling: Developing narratives that integrate both visual and textual elements.
Low-Resource Applications: Ideal for deployment on consumer-grade GPUs due to its efficiency and 8-bit quantization support.
Research and Development: An open-source solution for exploring multimodal AI, especially with a focus on Turkish language contexts.

Overview

Next 4B: Türkiye's First Vision-Language Model

Key Capabilities

Performance Highlights

Good For

Full Model Card (README)