Overview
Gemma 3: Multimodal Foundation Models by Google
Gemma 3 is a family of open, lightweight, and state-of-the-art multimodal models developed by Google, leveraging the same research and technology as the Gemini models. The google/gemma-3-27b-pt is a 27 billion parameter pre-trained model within this family, designed to handle both text and image inputs to generate text outputs.
Key Capabilities & Features
- Multimodal Input: Processes text strings and images (normalized to 896x896 resolution, encoded to 256 tokens each).
- Large Context Window: Supports a total input context of 128K tokens, enabling comprehensive understanding of long inputs.
- Multilingual Support: Trained on data in over 140 languages, enhancing its global applicability.
- Diverse Training Data: Pre-trained on 14 trillion tokens, including web documents, code, mathematics, and images, ensuring broad knowledge and reasoning abilities.
- Strong Performance: Achieves competitive benchmark results across reasoning, STEM, code, and multimodal tasks, including 85.6 on HellaSwag, 78.6 on MMLU, and 85.6 on DocVQA.
Good For
- Content Creation: Generating creative text formats, marketing copy, and email drafts.
- Conversational AI: Powering chatbots and virtual assistants.
- Text Summarization: Creating concise summaries of documents and research papers.
- Image Understanding: Extracting, interpreting, and summarizing visual data for text communications.
- Research & Education: Serving as a foundation for VLM and NLP research, and supporting language learning tools.