facebook/layerskip-llama3-8B
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Sep 7, 2024License:fairArchitecture:Transformer0.0K Gated Cold

The facebook/layerskip-llama3-8B is an 8 billion parameter Llama 3 model developed by Facebook, continually pretrained with LayerSkip technology. This model is specifically designed for self-speculative decoding, enabling faster inference by decoding with earlier layers and verifying with subsequent layers. It offers significant speedups in token generation compared to traditional autoregressive decoding, making it suitable for applications requiring high-throughput text generation.

Loading preview...

Overview

The facebook/layerskip-llama3-8B is an 8 billion parameter Llama 3 model from Facebook, enhanced with LayerSkip technology. This model is continually pretrained to support self-speculative decoding, a technique that accelerates inference by using a subset of layers for drafting and the full model for verification. This approach allows for faster token generation without compromising output quality.

Key Capabilities

  • Self-Speculative Decoding: Utilizes earlier layers to propose tokens and later layers to validate, significantly boosting inference speed.
  • Optimized Performance: Benchmarks show self-speculative decoding can achieve up to 1.8x speedup (e.g., 31.84 tokens/sec vs. 47.43 tokens/sec on an A100 GPU) compared to autoregressive decoding.
  • Memory Efficiency: Optimized implementations (available in the LayerSkip codebase and gpt-fast integration) re-use weights and KV cache of earlier layers, avoiding extra memory consumption.
  • Integration: Supports usage via HuggingFace (with a draft model approach), the dedicated LayerSkip codebase, and a specialized branch in PyTorch's gpt-fast for further optimizations like torch.compile() and quantization.

Good For

  • High-throughput text generation: Ideal for applications where rapid token generation is critical.
  • Research and development in efficient inference: Provides a practical implementation of self-speculative decoding for experimentation and integration.
  • Developers seeking faster Llama 3 deployments: Offers a performance-optimized variant of the Llama 3 8B model.