Name: xtuner/llava-llama-3-8b-v1_1 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: xtuner

Overview

xtuner/llava-llama-3-8b-v1_1 is a multimodal large language model (MLLM) developed by XTuner, built upon the LLaVA architecture. It leverages the robust Meta-Llama-3-8B-Instruct as its language backbone and CLIP-ViT-Large-patch14-336 for visual encoding. The model underwent fine-tuning using a combination of the ShareGPT4V-PT and InternVL-SFT datasets, enhancing its ability to understand and generate responses based on both image and text inputs.

Key Capabilities & Differentiators

Vision-Language Integration: Seamlessly processes and understands information from both images and text.
Enhanced Training Data: Utilizes larger and more diverse pretraining (ShareGPT4V-PT, 1.2M samples) and fine-tuning (InternVL-SFT, 1.2M samples) datasets compared to its predecessor, LLaVA-Llama-3-8B.
Performance Improvements: Demonstrates notable gains across several multimodal benchmarks, including MMBench Test (EN/CN), CCBench Dev, AI2D Test, and MMStar, indicating improved visual reasoning and comprehension.
LLaVA Format: Provided in XTuner's LLaVA format, with additional Hugging Face and GGUF format models available for broader compatibility.

Use Cases

This model is particularly well-suited for applications requiring:

Visual Question Answering (VQA): Answering questions about the content of images.
Image Captioning: Generating descriptive text for images.
Multimodal Chatbots: Creating interactive agents that can understand and respond to visual cues.
Multimodal Reasoning: Tasks that involve complex reasoning over visual and textual information.

Overview

Overview

Key Capabilities & Differentiators

Use Cases

Full Model Card (README)