Model Overview

The Sparse-Llama-3.1-8B-ultrachat_200k-2of4 is a multi-turn conversational AI model developed by Neural Magic. It is based on the Llama-3.1-8B architecture and has been fine-tuned on the ultrachat_200k dataset, featuring a context length of 32768 tokens.

Key Optimizations and Performance

A core differentiator of this model is its 2:4 sparsity pattern. This means that within each group of four weights in the transformer blocks' linear operators, two are retained while two are pruned. This optimization is inherited from its parent model, Sparse-Llama-3.1-8B-2of4.

Despite this significant sparsity, the model demonstrates strong performance recovery. On the AlpacaEval benchmark (version 1), it achieved a win rate of 61.1. This is highly comparable to the fine-tuned dense model Llama-3.1-8B-ultrachat_200k, which scored 62.0, indicating a 98.5% accuracy recovery due to sparsity.

Use Cases and Deployment

This model is primarily designed for multi-turn conversational AI applications. Its sparsity optimization makes it suitable for efficient deployment, particularly with backends like vLLM, which supports OpenAI-compatible serving. The model's ability to maintain high accuracy while being sparse offers advantages for resource-constrained environments or applications requiring faster inference.

Overview

Model Overview

Key Optimizations and Performance

Use Cases and Deployment

Full Model Card (README)