xtuner/llava-llama-3-8b

Warm
Public
8B
FP8
8192
Apr 22, 2024
Hugging Face
Overview

xtuner/llava-llama-3-8b: A Multimodal LLaVA Model

The xtuner/llava-llama-3-8b is a multimodal large language model (LLM) developed by XTuner, built upon the powerful Meta-Llama-3-8B-Instruct and integrated with the CLIP-ViT-Large-patch14-336 visual encoder. This model is specifically fine-tuned using the LLaVA-Pretrain and LLaVA-Instruct datasets, making it proficient in understanding and generating responses based on both textual and visual inputs.

Key Capabilities & Features

  • Multimodal Understanding: Combines the reasoning abilities of Llama 3 with visual perception from CLIP, allowing it to interpret images and answer questions about their content.
  • Enhanced Performance: Demonstrates improved scores across a range of multimodal benchmarks, including MMBench, CCBench, MMMU, SEED-IMG, and GQA, when compared to its predecessor, LLaVA-v1.5-7B.
  • XTuner LLaVA Format: Provided in a specific XTuner LLaVA format, with alternative HuggingFace LLaVA format models also available.
  • Flexible Fine-tuning Strategy: Utilizes a "Full LLM, LoRA ViT" fine-tuning strategy, indicating a focus on optimizing both language and vision components.

Good for

  • Visual Question Answering (VQA): Excels at tasks requiring the model to answer questions based on provided images.
  • Multimodal Chatbots: Ideal for applications where conversational agents need to understand and respond to queries involving visual context.
  • Image Captioning & Analysis: Can be used for generating descriptions of images or extracting information from visual data.
  • Research and Development: Provides a strong baseline for further research into multimodal AI, especially within the LLaVA framework.