Gemma 3 4B Pre-trained Model by Google DeepMind

This model is a 4.3 billion parameter variant from the Gemma 3 family, developed by Google DeepMind. It is a lightweight, state-of-the-art open model built using the same research and technology as the Gemini models. Gemma 3 models are multimodal, capable of processing both text and image inputs to generate text outputs.

Key Capabilities

Multimodal Input: Accepts text strings and images (normalized to 896x896 resolution, encoded to 256 tokens each).
Extensive Context Window: Features a total input context of 128K tokens for the 4B, 12B, and 27B sizes, and an output context of 8192 tokens.
Multilingual Support: Trained on data including content in over 140 languages.
Diverse Task Performance: Well-suited for text generation and image understanding tasks such as question answering, summarization, and reasoning.
Efficient Deployment: Its relatively small size makes it suitable for deployment in environments with limited resources like laptops, desktops, or cloud infrastructure.

Training and Performance

The 4B model was trained on 4 trillion tokens, encompassing web documents, code, mathematics, and images. It demonstrates strong performance across various benchmarks:

Reasoning & Factuality: Achieves 77.2 on HellaSwag (10-shot) and 50.9 on BIG-Bench Hard (few-shot).
STEM & Code: Scores 59.6 on MMLU (5-shot) and 36.0 on HumanEval (0-shot).
Multilingual: Reaches 34.7 on MGSM and 57.0 on Global-MMLU-Lite.
Multimodal: Attains 102 on COCOcap and 72.8 on DocVQA (val).

Intended Usage

This model is designed for a wide range of applications, including content creation (text generation, chatbots), research and education (NLP/VLM research, language learning tools), and image data extraction.