Name: facebook/layerskip-llama2-13B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: facebook

Overview

The facebook/layerskip-llama2-13B is a 13 billion parameter Llama2 model developed by Facebook, featuring continuous pretraining with LayerSkip technology. Its core innovation lies in enabling self-speculative decoding, a technique where the model decodes using its earlier layers and then verifies the output with subsequent layers. This method significantly improves inference speed by reducing computational overhead during generation.

Key Capabilities

Self-Speculative Decoding: Utilizes an early exit mechanism to accelerate token generation, as demonstrated by benchmarks showing up to 43.64 tokens/sec compared to 28.38 tokens/sec for autoregressive decoding on an A100 GPU.
Optimized Implementations: While a HuggingFace implementation is provided, optimized versions are available in the dedicated LayerSkip codebase and a gpt-fast branch. These optimized versions avoid extra memory consumption by reusing weights and KV cache.
Llama2 Architecture: Built upon the Llama2 13B foundation, inheriting its general language understanding and generation capabilities.

Good For

High-Speed Inference: Ideal for applications where rapid text generation is critical, thanks to its self-speculative decoding feature.
Resource-Efficient Deployment: The optimized LayerSkip and gpt-fast implementations offer memory and computational efficiency for deploying large language models.
Research and Development: Provides a practical example and codebase for exploring and implementing early exit and speculative decoding techniques in LLMs.

Overview

Overview

Key Capabilities

Good For

Full Model Card (README)