Overview
axolotl-mirrors/gemma-3-4b-pt is a 4.3 billion parameter pre-trained variant from Google DeepMind's Gemma 3 family of open models. These models are multimodal, accepting both text and image inputs (images normalized to 896x896 resolution, encoded to 256 tokens each) and producing text outputs. The 4B model specifically supports a 32K token input context and an 8192 token output context. It is built upon the same research and technology as the Gemini models, offering open weights and multilingual support across over 140 languages.
Key Capabilities
- Multimodal Understanding: Processes both text and image inputs for tasks like image content analysis and visual data extraction.
- Text Generation: Capable of generating diverse text formats, including answers, summaries, and creative content.
- Extensive Context: Features a 32K token input context window, beneficial for complex tasks.
- Multilingual Support: Trained on data in over 140 languages, enhancing its global applicability.
- Resource-Efficient Deployment: Its relatively small size makes it suitable for deployment on devices with limited resources, such as laptops and desktops.
Training and Performance
The 4B model was trained on 4 trillion tokens, encompassing web documents, code, mathematics, and images. This diverse dataset aims to improve its ability in areas like logical reasoning and programming language understanding. Benchmarks show strong performance in reasoning (e.g., 77.2 on HellaSwag), STEM (e.g., 59.6 on MMLU), and multimodal tasks (e.g., 102 on COCOcap), demonstrating improvements over previous Gemma models in safety and factual accuracy.