Name: neuralmagic/SparseLlama-3-8B-pruned_50.2of4 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: neuralmagic

SparseLlama-3-8B-pruned_50.2of4: A Sparsified Llama 3 Model

This model, developed by Neural Magic, is an 8 billion parameter variant of the Meta-Llama-3-8B architecture. It has undergone a two-stage optimization process to achieve 2:4 (N:M) semi-structured sparsity.

Key Optimization & Capabilities

Sparsity: The model was initially pruned in one-shot using SparseGPT to achieve 2:4 sparsity.
Knowledge Distillation: It was then retrained using SquareHead knowledge distillation, maintaining the sparsity mask to recover performance.
Performance: While introducing sparsity, the model retains a significant portion of the original Llama-3-8B's accuracy, with an average accuracy recovery of 97.68% on the Open LLM Leaderboard benchmarks and 94.22% on the Mosaic Eval Gauntlet.
Inference Optimization: Designed to leverage its semi-structured sparsity for faster inference and lower memory usage, particularly when deployed with specialized runtimes like nm-vllm.

Benchmarks

Compared to the original Meta-Llama-3-8B, this sparse model shows competitive performance:

Open LLM Leaderboard Average Accuracy: 60.72% (vs. 62.16% for base Llama-3-8B)
Mosaic Eval Gauntlet Average Accuracy: 51.54% (vs. 54.70% for base Llama-3-8B)

Ideal Use Cases

Efficient Deployment: Suitable for applications requiring optimized inference, reduced memory footprint, and faster execution on compatible hardware.
Resource-Constrained Environments: Beneficial for scenarios where computational resources are limited, but high-quality language generation is still needed.
Research in Sparsity: Provides a practical example of applying advanced pruning and distillation techniques to large language models.

Overview

SparseLlama-3-8B-pruned_50.2of4: A Sparsified Llama 3 Model

Key Optimization & Capabilities

Benchmarks

Ideal Use Cases

Full Model Card (README)