nnethercott/llava-v1.5-7b_vicuna: A Multimodal Instruction-Following Model
This model is a 7 billion parameter LLaVA variant, fine-tuned from liuhaotian/llava-v1.5-7b, specifically prepared for LLM benchmarking. It is an auto-regressive language model built on the transformer architecture, leveraging the LLaMA/Vicuna base.
Key Capabilities
- Multimodal Instruction Following: Trained on GPT-generated multimodal instruction-following data, enabling it to process and respond to instructions involving both text and images.
- Vision Integration: Inherits LLaVA's ability to understand visual inputs, making it suitable for tasks like Visual Question Answering (VQA).
- Training Data: Fine-tuned using a diverse dataset including 558K filtered image-text pairs, 158K GPT-generated multimodal instruction data, 450K academic-task-oriented VQA data, and 40K ShareGPT data.
Benchmarking Performance
Evaluated on the Open LLM Leaderboard, the model achieved an average score of 52.28. Notable scores include 76.09 on HellaSwag (10-Shot) and 72.06 on Winogrande (5-shot), indicating its proficiency in common sense reasoning. Its MMLU (5-Shot) score is 51.68, and GSM8k (5-shot) is 15.31.
Good For
- LLM Benchmarking: Ideal for evaluating multimodal capabilities and instruction following in a 7B parameter model.
- Research and Development: Useful for exploring multimodal AI applications based on the LLaVA architecture.
- Multimodal Understanding Tasks: Applicable for tasks that require processing and generating responses from combined visual and textual information.