cemoss17/nutrient-gram-qwen-3-vl-2b

VISIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Nov 5, 2025Architecture:Transformer0.0K Cold

The cemoss17/nutrient-gram-qwen-3-vl-2b is a 2 billion parameter vision-language model, fine-tuned from Qwen/Qwen3-VL-2B-Instruct. This model leverages a 32768 token context length and is specifically adapted for instruction-following tasks involving both text and visual inputs. It is designed for applications requiring multimodal understanding and generation, building upon the Qwen3-VL architecture.

Loading preview...

Overview

This model, cemoss17/nutrient-gram-qwen-3-vl-2b, is a fine-tuned variant of the Qwen3-VL-2B-Instruct architecture, developed by cemoss17. It is a 2 billion parameter vision-language model, meaning it can process and generate content based on both textual and visual inputs. The model was trained using the TRL (Transformer Reinforcement Learning) library, indicating a focus on instruction-following capabilities.

Key Capabilities

  • Multimodal Understanding: Processes both text and visual information, enabling it to respond to queries that combine these modalities.
  • Instruction Following: Fine-tuned to accurately follow instructions provided in natural language.
  • Text Generation: Capable of generating coherent and contextually relevant text based on given prompts.

Training Details

The model underwent a Supervised Fine-Tuning (SFT) process. The training utilized specific versions of key frameworks:

  • TRL: 0.24.0
  • Transformers: 4.57.1
  • Pytorch: 2.8.0
  • Datasets: 4.4.1
  • Tokenizers: 0.22.1

Good For

  • Applications requiring a compact yet capable vision-language model.
  • Instruction-based tasks where the model needs to interpret and respond to user commands.
  • Scenarios benefiting from a model fine-tuned for specific instruction-following behaviors.