xtuner/llava-llama-3-8b-v1_1
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 22, 2024Architecture:Transformer0.1K Warm

The xtuner/llava-llama-3-8b-v1_1 model is an 8 billion parameter LLaVA-based vision-language model developed by XTuner. It is fine-tuned from Meta Llama 3 8B Instruct and CLIP-ViT-Large-patch14-336, utilizing the ShareGPT4V-PT and InternVL-SFT datasets. This model is designed for multimodal tasks, integrating visual understanding with language capabilities, and demonstrates improved performance on various visual question answering and multimodal benchmarks.

Loading preview...

Overview

xtuner/llava-llama-3-8b-v1_1 is a multimodal large language model (MLLM) developed by XTuner, built upon the LLaVA architecture. It leverages the robust Meta-Llama-3-8B-Instruct as its language backbone and CLIP-ViT-Large-patch14-336 for visual encoding. The model underwent fine-tuning using a combination of the ShareGPT4V-PT and InternVL-SFT datasets, enhancing its ability to understand and generate responses based on both image and text inputs.

Key Capabilities & Differentiators

  • Vision-Language Integration: Seamlessly processes and understands information from both images and text.
  • Enhanced Training Data: Utilizes larger and more diverse pretraining (ShareGPT4V-PT, 1.2M samples) and fine-tuning (InternVL-SFT, 1.2M samples) datasets compared to its predecessor, LLaVA-Llama-3-8B.
  • Performance Improvements: Demonstrates notable gains across several multimodal benchmarks, including MMBench Test (EN/CN), CCBench Dev, AI2D Test, and MMStar, indicating improved visual reasoning and comprehension.
  • LLaVA Format: Provided in XTuner's LLaVA format, with additional Hugging Face and GGUF format models available for broader compatibility.

Use Cases

This model is particularly well-suited for applications requiring:

  • Visual Question Answering (VQA): Answering questions about the content of images.
  • Image Captioning: Generating descriptive text for images.
  • Multimodal Chatbots: Creating interactive agents that can understand and respond to visual cues.
  • Multimodal Reasoning: Tasks that involve complex reasoning over visual and textual information.