Name: facebook/layerskip-llama3-8B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: facebook

Overview

The facebook/layerskip-llama3-8B is an 8 billion parameter Llama 3 model from Facebook, enhanced with LayerSkip technology. This model is continually pretrained to support self-speculative decoding, a technique that accelerates inference by using a subset of layers for drafting and the full model for verification. This approach allows for faster token generation without compromising output quality.

Key Capabilities

Self-Speculative Decoding: Utilizes earlier layers to propose tokens and later layers to validate, significantly boosting inference speed.
Optimized Performance: Benchmarks show self-speculative decoding can achieve up to 1.8x speedup (e.g., 31.84 tokens/sec vs. 47.43 tokens/sec on an A100 GPU) compared to autoregressive decoding.
Memory Efficiency: Optimized implementations (available in the LayerSkip codebase and gpt-fast integration) re-use weights and KV cache of earlier layers, avoiding extra memory consumption.
Integration: Supports usage via HuggingFace (with a draft model approach), the dedicated LayerSkip codebase, and a specialized branch in PyTorch's gpt-fast for further optimizations like torch.compile() and quantization.

Good For

High-throughput text generation: Ideal for applications where rapid token generation is critical.
Research and development in efficient inference: Provides a practical implementation of self-speculative decoding for experimentation and integration.
Developers seeking faster Llama 3 deployments: Offers a performance-optimized variant of the Llama 3 8B model.

Overview

Overview

Key Capabilities

Good For

Full Model Card (README)