InferenceIllusionist/Mistral-RealworldQA-v0.2-7b-SFT

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Apr 18, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

InferenceIllusionist's Mistral-RealworldQA-v0.2-7b-SFT is a 7 billion parameter Mistral-based vision-language model, fine-tuned to reduce hallucinations in Visual Question Answering (VQA) tasks. It specializes in generating concise image captions and providing factual answers to visual queries, leveraging a 4096-token context length. This model is optimized for use cases requiring short, factual descriptions of images with a drier, less conversational tone.

Loading preview...

Model Overview

InferenceIllusionist's Mistral-RealworldQA-v0.2-7b-SFT is a 7 billion parameter vision-language model built upon the Mistral-7b-02 base. This model is the second iteration in a series of experiments focused on fine-tuning for image captioning and aims to significantly reduce hallucinations in Visual Question Answering (VQA) tasks. It was fine-tuned using the RealWorldQA dataset, originally provided by the X.Ai Team.

Key Capabilities & Characteristics

  • Reduced Hallucinations: Specifically designed to minimize inaccurate outputs when answering questions about images.
  • Concise Output: Provides shorter, less verbose responses for image-related queries, ideal for low token count requirements.
  • Vision Functionality: Requires an additional mmproj file for vision capabilities, with both quantized (197MB) and unquantized (596MB) options available for quality and VRAM considerations.
  • Drier Tone: Lacks the conversational prose of other models, offering a more direct and factual communication style.
  • Alpaca Prompt Format: Optimized for best results when using the Alpaca prompt format.

Use Cases

This model is best suited for applications requiring:

  • Image Captioning: Generating brief and accurate descriptions for images.
  • Factual VQA: Answering direct questions about visual content where conciseness and accuracy are prioritized over conversational fluency.

Technical Details

The model was fine-tuned from mistral-community/Mistral-7B-v0.2 and utilized Unsloth and Huggingface's TRL library for faster training.