dxv2k/gemma-4-E4B-it-merged

VISIONConcurrency Cost:1Model Size:7.9BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Jun 25, 2026License:gemmaArchitecture:Transformer Cold

dxv2k/gemma-4-E4B-it-merged is a 7.9 billion parameter language model based on the Google Gemma-4-E4B-it architecture, enhanced by merging a LoRA adapter into its base weights. This model is specifically configured for instruction-following tasks and requires a GPU with at least 24 GB VRAM for inference. It is designed for developers seeking a Gemma-4 variant with integrated LoRA for improved performance in conversational and text generation applications.

Loading preview...

Model Overview

This model, dxv2k/gemma-4-E4B-it-merged, is a 7.9 billion parameter variant of the google/gemma-4-E4B-it architecture. It incorporates a LoRA adapter, which has been merged directly into the base weights. The LoRA adapter targets the q/k/v/o/gate/up/down layers of the language model with r=16 and alpha=32, enhancing its capabilities for instruction-tuned tasks.

Key Characteristics

  • Architecture: Based on Google's Gemma-4-E4B-it.
  • Parameter Count: 7.9 billion parameters.
  • Precision: Uses bf16 (bfloat16) for its weights.
  • Context Length: Supports a context length of 32768 tokens.
  • LoRA Integration: Features a merged LoRA adapter for improved performance without external adapter loading.

Deployment and Usage

  • Inference Requirements: Requires a GPU with at least 24 GB VRAM due to its bf16 size (approximately 16 GB).
  • Custom Handler: Utilizes a custom handler.py for inference, as the Gemma 4 architecture is new and requires specific handling.
  • Dependencies: Pins transformers>=5.12.1 for compatibility.
  • Local Use: Provides clear Python code examples for local inference using transformers library, demonstrating how to apply chat templates and generate responses.