nnethercott/llava-v1.5-7b_vicuna

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Feb 25, 2024License:llama2Architecture:Transformer Open Weights Cold

nnethercott/llava-v1.5-7b_vicuna is a 7 billion parameter LLaVA model, fine-tuned from liuhaotian/llava-v1.5-7b, designed for multimodal instruction-following tasks. Based on the LLaMA/Vicuna architecture, this auto-regressive language model integrates vision capabilities. It is primarily intended for LLM benchmarking and applications requiring understanding and generating responses from combined image and text inputs.

Loading preview...

nnethercott/llava-v1.5-7b_vicuna: A Multimodal Instruction-Following Model

This model is a 7 billion parameter LLaVA variant, fine-tuned from liuhaotian/llava-v1.5-7b, specifically prepared for LLM benchmarking. It is an auto-regressive language model built on the transformer architecture, leveraging the LLaMA/Vicuna base.

Key Capabilities

  • Multimodal Instruction Following: Trained on GPT-generated multimodal instruction-following data, enabling it to process and respond to instructions involving both text and images.
  • Vision Integration: Inherits LLaVA's ability to understand visual inputs, making it suitable for tasks like Visual Question Answering (VQA).
  • Training Data: Fine-tuned using a diverse dataset including 558K filtered image-text pairs, 158K GPT-generated multimodal instruction data, 450K academic-task-oriented VQA data, and 40K ShareGPT data.

Benchmarking Performance

Evaluated on the Open LLM Leaderboard, the model achieved an average score of 52.28. Notable scores include 76.09 on HellaSwag (10-Shot) and 72.06 on Winogrande (5-shot), indicating its proficiency in common sense reasoning. Its MMLU (5-Shot) score is 51.68, and GSM8k (5-shot) is 15.31.

Good For

  • LLM Benchmarking: Ideal for evaluating multimodal capabilities and instruction following in a 7B parameter model.
  • Research and Development: Useful for exploring multimodal AI applications based on the LLaVA architecture.
  • Multimodal Understanding Tasks: Applicable for tasks that require processing and generating responses from combined visual and textual information.