cemoss17/nutrient-gram-qwen-3-vl-2b
The cemoss17/nutrient-gram-qwen-3-vl-2b is a 2 billion parameter vision-language model, fine-tuned from Qwen/Qwen3-VL-2B-Instruct. This model leverages a 32768 token context length and is specifically adapted for instruction-following tasks involving both text and visual inputs. It is designed for applications requiring multimodal understanding and generation, building upon the Qwen3-VL architecture.
Loading preview...
Overview
This model, cemoss17/nutrient-gram-qwen-3-vl-2b, is a fine-tuned variant of the Qwen3-VL-2B-Instruct architecture, developed by cemoss17. It is a 2 billion parameter vision-language model, meaning it can process and generate content based on both textual and visual inputs. The model was trained using the TRL (Transformer Reinforcement Learning) library, indicating a focus on instruction-following capabilities.
Key Capabilities
- Multimodal Understanding: Processes both text and visual information, enabling it to respond to queries that combine these modalities.
- Instruction Following: Fine-tuned to accurately follow instructions provided in natural language.
- Text Generation: Capable of generating coherent and contextually relevant text based on given prompts.
Training Details
The model underwent a Supervised Fine-Tuning (SFT) process. The training utilized specific versions of key frameworks:
- TRL: 0.24.0
- Transformers: 4.57.1
- Pytorch: 2.8.0
- Datasets: 4.4.1
- Tokenizers: 0.22.1
Good For
- Applications requiring a compact yet capable vision-language model.
- Instruction-based tasks where the model needs to interpret and respond to user commands.
- Scenarios benefiting from a model fine-tuned for specific instruction-following behaviors.