THChou1220/gemma-4-e4b-kinetics54K_FFT

VISIONConcurrency Cost:1Model Size:7.9BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Jun 27, 2026Architecture:Transformer Cold

THChou1220/gemma-4-e4b-kinetics54K_FFT is a 7.9 billion parameter Gemma-4-e4b model, fully fine-tuned on AI-generated video data derived from Kinetics. This model specializes in understanding and processing video-related instructions, making it highly effective for tasks involving video content analysis and generation. Its training on 54,618 video instruction examples positions it for applications requiring deep comprehension of visual sequences.

Loading preview...

Model Overview

THChou1220/gemma-4-e4b-kinetics54K_FFT is a 7.9 billion parameter model based on the google/gemma-4-e4b-it architecture. This model has undergone a full fine-tuning process, rather than LoRA, specifically on a dataset of AI-generated video data. The training focused on enhancing its capabilities related to video understanding and instruction following.

Key Training Details

  • Dataset: Trained on bear7011/gemma-4-e4b-kinetics_54K, comprising 54,618 video instruction examples.
  • Methodology: Utilized full fine-tuning with bfloat16 precision over 1 epoch, achieving 1366 global steps.
  • Hardware & Optimization: Training was conducted on 4 GPUs, leveraging DeepSpeed ZeRO-3 with CPU optimizer and parameter offload for efficient resource management.
  • Configuration: Employed an AdamW optimizer with a learning rate of 5e-6 for both the main model and specific projector/image encoder components. Gradient checkpointing was enabled, and a maximum sequence length of 3072 was used.

Primary Use Case

This model is particularly well-suited for applications that require processing and responding to instructions related to video content. Its specialized training on video-derived data makes it a strong candidate for tasks such as video analysis, understanding actions within videos, or generating text based on video prompts.