Model Overview
nnethercott/llava-v1.5-7b-hf-vicuna is a 7 billion parameter vision-language model (VLM) derived from llava-hf/llava-1.5-7b-hf. It is an auto-regressive language model built upon the transformer architecture, specifically fine-tuned from LLaMA/Vicuna. This model's primary motivation is to facilitate LLM benchmarking, particularly for tasks requiring multimodal understanding.
Key Capabilities
- Multimodal Instruction Following: LLaVA is trained to follow instructions that involve both text and images, enabling it to understand and generate responses in a multimodal context.
- Vision-Language Integration: It combines the power of large language models with visual processing capabilities, allowing for richer interactions.
- Benchmarking Foundation: The model is provided to support comprehensive evaluation of multimodal AI systems.
Training Details
The model was fine-tuned using a diverse dataset, including:
- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
- 158K GPT-generated multimodal instruction-following data.
- 450K academic-task-oriented VQA data mixture.
- 40K ShareGPT data.
Performance Highlights
Evaluated on the Open LLM Leaderboard, nnethercott/llava-v1.5-7b-hf-vicuna achieved an average score of 52.28. Notable scores include:
- HellaSwag (10-Shot): 76.09
- AI2 Reasoning Challenge (25-Shot): 52.65
- MMLU (5-Shot): 51.68
Good for
- Multimodal AI Research: Ideal for researchers and developers exploring vision-language models and their applications.
- Benchmarking: Suitable for evaluating the performance of LLMs on tasks that require understanding both visual and textual inputs.
- Instruction-Following Tasks: Can be used for tasks where the model needs to interpret and act upon multimodal instructions.