neuralmagic/SparseLlama-3-8B-pruned_50.2of4
The SparseLlama-3-8B-pruned_50.2of4 model by Neural Magic is an 8 billion parameter Llama 3 variant that has been pruned to 2:4 (N:M) sparsity using SparseGPT and further retrained with SquareHead knowledge distillation. This model maintains a high level of accuracy compared to the original Llama-3-8B while offering the benefits of semi-structured sparsity for potentially faster and more memory-efficient inference. It is designed for use cases where optimized deployment with reduced computational overhead is critical.
Loading preview...
SparseLlama-3-8B-pruned_50.2of4: A Sparsified Llama 3 Model
This model, developed by Neural Magic, is an 8 billion parameter variant of the Meta-Llama-3-8B architecture. It has undergone a two-stage optimization process to achieve 2:4 (N:M) semi-structured sparsity.
Key Optimization & Capabilities
- Sparsity: The model was initially pruned in one-shot using SparseGPT to achieve 2:4 sparsity.
- Knowledge Distillation: It was then retrained using SquareHead knowledge distillation, maintaining the sparsity mask to recover performance.
- Performance: While introducing sparsity, the model retains a significant portion of the original Llama-3-8B's accuracy, with an average accuracy recovery of 97.68% on the Open LLM Leaderboard benchmarks and 94.22% on the Mosaic Eval Gauntlet.
- Inference Optimization: Designed to leverage its semi-structured sparsity for faster inference and lower memory usage, particularly when deployed with specialized runtimes like
nm-vllm.
Benchmarks
Compared to the original Meta-Llama-3-8B, this sparse model shows competitive performance:
- Open LLM Leaderboard Average Accuracy: 60.72% (vs. 62.16% for base Llama-3-8B)
- Mosaic Eval Gauntlet Average Accuracy: 51.54% (vs. 54.70% for base Llama-3-8B)
Ideal Use Cases
- Efficient Deployment: Suitable for applications requiring optimized inference, reduced memory footprint, and faster execution on compatible hardware.
- Resource-Constrained Environments: Beneficial for scenarios where computational resources are limited, but high-quality language generation is still needed.
- Research in Sparsity: Provides a practical example of applying advanced pruning and distillation techniques to large language models.