nnethercott/llava-v1.5-7b-hf-vicuna

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Feb 23, 2024License:llama2Architecture:Transformer Open Weights Cold

nnethercott/llava-v1.5-7b-hf-vicuna is a 7 billion parameter vision-language model, fine-tuned from LLaMA/Vicuna, designed for multimodal instruction-following tasks. This model integrates visual understanding with language generation, making it capable of processing and responding to queries based on both text and images. It is specifically intended for LLM benchmarking, offering a robust foundation for evaluating multimodal AI performance.

Loading preview...

Model Overview

nnethercott/llava-v1.5-7b-hf-vicuna is a 7 billion parameter vision-language model (VLM) derived from llava-hf/llava-1.5-7b-hf. It is an auto-regressive language model built upon the transformer architecture, specifically fine-tuned from LLaMA/Vicuna. This model's primary motivation is to facilitate LLM benchmarking, particularly for tasks requiring multimodal understanding.

Key Capabilities

  • Multimodal Instruction Following: LLaVA is trained to follow instructions that involve both text and images, enabling it to understand and generate responses in a multimodal context.
  • Vision-Language Integration: It combines the power of large language models with visual processing capabilities, allowing for richer interactions.
  • Benchmarking Foundation: The model is provided to support comprehensive evaluation of multimodal AI systems.

Training Details

The model was fine-tuned using a diverse dataset, including:

  • 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
  • 158K GPT-generated multimodal instruction-following data.
  • 450K academic-task-oriented VQA data mixture.
  • 40K ShareGPT data.

Performance Highlights

Evaluated on the Open LLM Leaderboard, nnethercott/llava-v1.5-7b-hf-vicuna achieved an average score of 52.28. Notable scores include:

  • HellaSwag (10-Shot): 76.09
  • AI2 Reasoning Challenge (25-Shot): 52.65
  • MMLU (5-Shot): 51.68

Good for

  • Multimodal AI Research: Ideal for researchers and developers exploring vision-language models and their applications.
  • Benchmarking: Suitable for evaluating the performance of LLMs on tasks that require understanding both visual and textual inputs.
  • Instruction-Following Tasks: Can be used for tasks where the model needs to interpret and act upon multimodal instructions.