Overview
Gemma 3 4B Pre-trained Model by Google DeepMind
This model is a 4.3 billion parameter variant from the Gemma 3 family, developed by Google DeepMind. It is a lightweight, state-of-the-art open model built using the same research and technology as the Gemini models. Gemma 3 models are multimodal, capable of processing both text and image inputs to generate text outputs.
Key Capabilities
- Multimodal Input: Accepts text strings and images (normalized to 896x896 resolution, encoded to 256 tokens each).
- Extensive Context Window: Features a total input context of 128K tokens for the 4B, 12B, and 27B sizes, and an output context of 8192 tokens.
- Multilingual Support: Trained on data including content in over 140 languages.
- Diverse Task Performance: Well-suited for text generation and image understanding tasks such as question answering, summarization, and reasoning.
- Efficient Deployment: Its relatively small size makes it suitable for deployment in environments with limited resources like laptops, desktops, or cloud infrastructure.
Training and Performance
The 4B model was trained on 4 trillion tokens, encompassing web documents, code, mathematics, and images. It demonstrates strong performance across various benchmarks:
- Reasoning & Factuality: Achieves 77.2 on HellaSwag (10-shot) and 50.9 on BIG-Bench Hard (few-shot).
- STEM & Code: Scores 59.6 on MMLU (5-shot) and 36.0 on HumanEval (0-shot).
- Multilingual: Reaches 34.7 on MGSM and 57.0 on Global-MMLU-Lite.
- Multimodal: Attains 102 on COCOcap and 72.8 on DocVQA (val).
Intended Usage
This model is designed for a wide range of applications, including content creation (text generation, chatbots), research and education (NLP/VLM research, language learning tools), and image data extraction.