Sparse-Llama-3.1-8B-2of4: Efficient Llama-3.1 with 2:4 Sparsity

This model is a specialized version of the Llama-3.1-8B architecture, developed by Neural Magic, featuring 2:4 semi-structured sparsity. This optimization significantly reduces the model's size and computational demands by pruning two out of every four weights in its linear operators, leading to a 2x reduction in resource usage.

Key Optimizations and Performance

The model was created using an optimized SparseGPT approach via LLM-Compressor for pruning, followed by 13 billion tokens of knowledge distillation using the SquareHead method to recover accuracy. This process ensures that the sparse model retains performance close to its dense equivalent:

Accuracy Recovery: Achieves 98.37% accuracy recovery on the OpenLLM benchmark (average score 62.16 vs. 63.19 for dense) and 97.3% on the Mosaic Eval Gauntlet (average score 53.85 vs. 55.34 for dense).
Efficiency: The 2:4 sparsity pattern allows for more efficient inference and deployment, making it suitable for cost-sensitive or resource-constrained environments.

Deployment and Use Cases

Designed for efficient deployment, this model is compatible with the vLLM backend, which also supports OpenAI-compatible serving. It provides an ideal foundation for developers looking to:

Reduce deployment costs for Llama-3.1-8B applications.
Improve inference performance on GPUs.
Create highly optimized versions of large language models for enterprise needs without significant accuracy loss.

Overview

Sparse-Llama-3.1-8B-2of4: Efficient Llama-3.1 with 2:4 Sparsity

Key Optimizations and Performance

Deployment and Use Cases

Full Model Card (README)