neuralmagic/SparseLlama-3-8B-pruned_50.2of4

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kArchitecture:Transformer Warm

The SparseLlama-3-8B-pruned_50.2of4 model by Neural Magic is an 8 billion parameter Llama 3 variant that has been pruned to 2:4 (N:M) sparsity using SparseGPT and further retrained with SquareHead knowledge distillation. This model maintains a high level of accuracy compared to the original Llama-3-8B while offering the benefits of semi-structured sparsity for potentially faster and more memory-efficient inference. It is designed for use cases where optimized deployment with reduced computational overhead is critical.

Loading preview...

SparseLlama-3-8B-pruned_50.2of4: A Sparsified Llama 3 Model

This model, developed by Neural Magic, is an 8 billion parameter variant of the Meta-Llama-3-8B architecture. It has undergone a two-stage optimization process to achieve 2:4 (N:M) semi-structured sparsity.

Key Optimization & Capabilities

  • Sparsity: The model was initially pruned in one-shot using SparseGPT to achieve 2:4 sparsity.
  • Knowledge Distillation: It was then retrained using SquareHead knowledge distillation, maintaining the sparsity mask to recover performance.
  • Performance: While introducing sparsity, the model retains a significant portion of the original Llama-3-8B's accuracy, with an average accuracy recovery of 97.68% on the Open LLM Leaderboard benchmarks and 94.22% on the Mosaic Eval Gauntlet.
  • Inference Optimization: Designed to leverage its semi-structured sparsity for faster inference and lower memory usage, particularly when deployed with specialized runtimes like nm-vllm.

Benchmarks

Compared to the original Meta-Llama-3-8B, this sparse model shows competitive performance:

  • Open LLM Leaderboard Average Accuracy: 60.72% (vs. 62.16% for base Llama-3-8B)
  • Mosaic Eval Gauntlet Average Accuracy: 51.54% (vs. 54.70% for base Llama-3-8B)

Ideal Use Cases

  • Efficient Deployment: Suitable for applications requiring optimized inference, reduced memory footprint, and faster execution on compatible hardware.
  • Resource-Constrained Environments: Beneficial for scenarios where computational resources are limited, but high-quality language generation is still needed.
  • Research in Sparsity: Provides a practical example of applying advanced pruning and distillation techniques to large language models.