neuralmagic/Sparse-Llama-3.1-8B-ultrachat_200k-2of4
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kLicense:llama3.1Architecture:Transformer0.0K Warm

The Neural Magic Sparse-Llama-3.1-8B-ultrachat_200k-2of4 is an 8 billion parameter Llama-3.1-based conversational AI model, optimized with 2:4 sparsity. Fine-tuned on the ultrachat_200k dataset, it achieves a 61.1 AlpacaEval score, demonstrating 98.5% accuracy recovery compared to its dense counterpart. This model is designed for efficient multi-turn conversational applications, leveraging sparsity for optimized deployment.

Loading preview...

Model Overview

The Sparse-Llama-3.1-8B-ultrachat_200k-2of4 is a multi-turn conversational AI model developed by Neural Magic. It is based on the Llama-3.1-8B architecture and has been fine-tuned on the ultrachat_200k dataset, featuring a context length of 32768 tokens.

Key Optimizations and Performance

A core differentiator of this model is its 2:4 sparsity pattern. This means that within each group of four weights in the transformer blocks' linear operators, two are retained while two are pruned. This optimization is inherited from its parent model, Sparse-Llama-3.1-8B-2of4.

Despite this significant sparsity, the model demonstrates strong performance recovery. On the AlpacaEval benchmark (version 1), it achieved a win rate of 61.1. This is highly comparable to the fine-tuned dense model Llama-3.1-8B-ultrachat_200k, which scored 62.0, indicating a 98.5% accuracy recovery due to sparsity.

Use Cases and Deployment

This model is primarily designed for multi-turn conversational AI applications. Its sparsity optimization makes it suitable for efficient deployment, particularly with backends like vLLM, which supports OpenAI-compatible serving. The model's ability to maintain high accuracy while being sparse offers advantages for resource-constrained environments or applications requiring faster inference.