Overview
Overview
Google DeepMind's Gemma 3 models are a family of lightweight, open multimodal models, leveraging the same research and technology as the Gemini models. This 27 billion parameter instruction-tuned variant is designed to process both text and image inputs, generating text outputs. It features a substantial 128K context window and offers multilingual support across over 140 languages.
Key Capabilities
- Multimodal Processing: Handles text and image input (images normalized to 896x896 resolution, encoded to 256 tokens each) and generates text output.
- Extensive Context Window: Supports a total input context of 128K tokens, enabling processing of longer and more complex inputs.
- Multilingual Support: Trained on data including content in over 140 languages, enhancing its global applicability.
- Reasoning and Factuality: Demonstrates strong performance across various reasoning and factuality benchmarks, including 85.6 on HellaSwag (10-shot) and 85.5 on TriviaQA (5-shot).
- STEM and Code: Achieves 78.6 on MMLU (5-shot), 82.6 on GSM8K (8-shot), and 48.8 on HumanEval (0-shot), indicating proficiency in scientific, technical, engineering, and mathematical tasks, as well as code generation.
- Multimodal Benchmarks: Scores 116 on COCOcap and 85.6 on DocVQA (val), showcasing its image understanding and analysis capabilities.
Good For
- Content Creation: Generating creative text formats, marketing copy, and email drafts.
- Conversational AI: Powering chatbots, virtual assistants, and interactive applications.
- Text Summarization: Creating concise summaries of documents, research papers, or reports.
- Image Data Extraction: Interpreting and summarizing visual data for text communications.
- Research and Education: Serving as a foundation for VLM and NLP research, language learning tools, and knowledge exploration.
- Resource-Limited Deployment: Its optimized size makes it suitable for deployment on devices with limited computational resources, such as laptops and desktops.