chancharikm/sft_caption_generation_20260222_ep6_lr3e5_qwen3-vl-8b

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Mar 31, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The chancharikm/sft_caption_generation_20260222_ep6_lr3e5_qwen3-vl-8b is an 8 billion parameter vision-language model, fine-tuned from Qwen/Qwen3-VL-8B-Instruct. This model specializes in image caption generation, leveraging its base architecture's 32K token context length. It is optimized for tasks requiring descriptive text output based on visual input.

Loading preview...

Overview

This model, chancharikm/sft_caption_generation_20260222_ep6_lr3e5_qwen3-vl-8b, is a fine-tuned iteration of the Qwen3-VL-8B-Instruct base model, developed by chancharikm. It features 8 billion parameters and maintains the original model's substantial 32,768 token context length, making it suitable for processing extensive visual and textual inputs.

Key Capabilities

  • Image Caption Generation: The model has been specifically fine-tuned on the sft_caption_generation_20260222 dataset, indicating its primary strength in generating descriptive captions for images.
  • Vision-Language Understanding: Inherits the multimodal capabilities of the Qwen3-VL-8B-Instruct architecture, allowing it to interpret visual information and produce relevant textual outputs.

Training Details

The model underwent supervised fine-tuning (SFT) with a learning rate of 3e-05 over 6 epochs. Training utilized a distributed setup across 8 GPUs, with a total batch size of 128 (achieved with gradient accumulation steps of 2). The optimizer used was adamw_torch_fused with a cosine learning rate scheduler and a warmup ratio of 0.05.

Intended Use Cases

This model is best suited for applications requiring automated, high-quality image descriptions, such as:

  • Content accessibility (e.g., generating alt text for images).
  • Automated content moderation or tagging.
  • Enhancing searchability of image databases through descriptive captions.