RedHatAI/Sparse-Llama-3.1-8B-2of4

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Nov 20, 2024License:llama3.1Architecture:Transformer0.1K Cold

The RedHatAI/Sparse-Llama-3.1-8B-2of4 model, developed by Neural Magic, is a 8 billion parameter Llama-3.1 variant optimized with 2:4 semi-structured sparsity. This optimization delivers a 2x reduction in model size and compute requirements while maintaining high accuracy. It achieves 98.37% accuracy recovery on OpenLLM benchmarks and 97.3% on Mosaic Eval Gauntlet compared to its dense counterpart, making it ideal for efficient and scalable AI deployments.

Loading preview...

Sparse-Llama-3.1-8B-2of4: Efficient Llama-3.1 with 2:4 Sparsity

This model is a specialized version of the Llama-3.1-8B architecture, developed by Neural Magic, featuring 2:4 semi-structured sparsity. This optimization significantly reduces the model's size and computational demands by pruning two out of every four weights in its linear operators, leading to a 2x reduction in resource usage.

Key Optimizations and Performance

The model was created using an optimized SparseGPT approach via LLM-Compressor for pruning, followed by 13 billion tokens of knowledge distillation using the SquareHead method to recover accuracy. This process ensures that the sparse model retains performance close to its dense equivalent:

  • Accuracy Recovery: Achieves 98.37% accuracy recovery on the OpenLLM benchmark (average score 62.16 vs. 63.19 for dense) and 97.3% on the Mosaic Eval Gauntlet (average score 53.85 vs. 55.34 for dense).
  • Efficiency: The 2:4 sparsity pattern allows for more efficient inference and deployment, making it suitable for cost-sensitive or resource-constrained environments.

Deployment and Use Cases

Designed for efficient deployment, this model is compatible with the vLLM backend, which also supports OpenAI-compatible serving. It provides an ideal foundation for developers looking to:

  • Reduce deployment costs for Llama-3.1-8B applications.
  • Improve inference performance on GPUs.
  • Create highly optimized versions of large language models for enterprise needs without significant accuracy loss.