Zhengyi/LLaMA-Mesh

Warm
Public
8B
FP8
32768
License: llama3.1
Hugging Face
Overview

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

LLaMA-Mesh is an 8-billion parameter model, fine-tuned from Meta's Llama 3.1, that integrates 3D mesh generation directly into a large language model. Developed by Zhengyi Wang et al. and fine-tuned by Nvidia, this model addresses the challenge of processing 3D data by representing vertex coordinates and face definitions as plain text, allowing seamless integration without expanding the LLM's vocabulary. This novel approach enables LLMs to leverage spatial knowledge from textual sources and perform conversational 3D generation.

Key Capabilities

  • Unified Text and 3D Mesh Processing: Represents 3D mesh data (vertex coordinates, face definitions) as plain text, allowing LLMs to process both modalities within a single framework.
  • Conversational 3D Generation: Enables generating 3D meshes from text prompts and producing interleaved text and 3D mesh outputs.
  • 3D Mesh Understanding: Capable of interpreting and understanding 3D meshes.
  • Performance: Achieves 3D mesh generation quality on par with models trained from scratch, while preserving strong text generation capabilities.

Training and Data

The model is fine-tuned on a supervised dataset derived from a subset of the Objaverse dataset, specifically 30,000 meshes with fewer than 500 faces, converted into text strings. This training allows the model to acquire complex spatial knowledge for 3D mesh generation in a text-based format.

Licensing

LLaMA-Mesh is distributed under the NSCLv1 License for non-commercial use and incorporates components of Llama 3.1 technology, subject to the Llama 3.1 Community License Agreement.