chancharikm/sft_caption_generation_20260222_ep3_lr3e5_qwen3-vl-8b_cam_ready

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Mar 27, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The chancharikm/sft_caption_generation_20260222_ep3_lr3e5_qwen3-vl-8b_cam_ready model is a fine-tuned 8-billion parameter Qwen3-VL-8B-Instruct model, developed by chancharikm. This vision-language model is specifically optimized for supervised fine-tuning (SFT) on caption generation tasks. It leverages its Qwen3-VL architecture to process visual inputs and generate descriptive text, making it suitable for image understanding applications.

Loading preview...

Model Overview

This model, chancharikm/sft_caption_generation_20260222_ep3_lr3e5_qwen3-vl-8b_cam_ready, is a specialized fine-tuned version of the Qwen/Qwen3-VL-8B-Instruct base model. It has 8 billion parameters and is designed for vision-language tasks, specifically focusing on caption generation.

Key Characteristics

  • Base Model: Fine-tuned from Qwen/Qwen3-VL-8B-Instruct, indicating strong multimodal capabilities.
  • Task Focus: Optimized through supervised fine-tuning (SFT) for generating descriptive captions from visual inputs.
  • Training Details:
    • Learning Rate: 3e-05
    • Batch Size: 8 (train and eval)
    • Epochs: 3.0
    • Optimizer: adamw_torch_fused
    • Scheduler: cosine with 0.05 warmup ratio.

Intended Use Cases

This model is primarily intended for applications requiring the generation of textual descriptions or captions for images. Its fine-tuning process suggests a focus on accuracy and relevance in visual content summarization.