RedHatAI/Llama-2-7b-pruned70-retrained

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Mar 15, 2024Architecture:Transformer0.0K Cold

The RedHatAI/Llama-2-7b-pruned70-retrained model is a 7 billion parameter Llama 2 variant developed by Neural Magic and Cerebras. This model has undergone significant pruning, achieving 70% sparsity, and was subsequently retrained on 150 billion tokens from SlimPajama. It is optimized for efficient deployment and fine-tuning through sparse transfer, offering a balance between performance and computational cost.

Loading preview...

Overview

RedHatAI/Llama-2-7b-pruned70-retrained is a 7 billion parameter model based on the Llama 2 architecture, developed by Neural Magic and Cerebras. This model distinguishes itself through its high sparsity, achieved by pruning 70% of its parameters in a one-shot process using SparseGPT, followed by extensive retraining. Initially, 50% of parameters were pruned and the model was retrained with 50 billion tokens from SlimPajama, maintaining sparsity. Subsequently, it was pruned to 70% sparsity and trained for an additional 100 billion tokens.

Key Capabilities

  • High Sparsity: Achieves 70% parameter sparsity, enabling more efficient inference and deployment.
  • Retrained Performance: Despite significant pruning, the model was retrained on 150 billion tokens (50B + 100B) from SlimPajama to maintain and recover performance.
  • Sparse Transfer: Designed to leverage its pre-sparsified structure for efficient fine-tuning on new data, reducing training times and computational costs.
  • Accelerated Inference: Compatible with specialized inference engines like nm-vllm and deepsparse for optimized performance.

Benchmarks

While pruning introduces some performance trade-offs compared to the original Llama-2-7b, the model shows competitive results, particularly in code generation:

  • HumanEval (pass@1): 14.4% (vs. 13.4% for Llama-2-7b)
  • MMLU (5-shot): 36.5% (vs. 46.9% for Llama-2-7b)
  • HellaSwag (0-shot): 74.1% (vs. 78.6% for Llama-2-7b)

Good for

  • Resource-constrained environments: Its high sparsity makes it suitable for deployment where computational resources are limited.
  • Efficient fine-tuning: Ideal for users looking to fine-tune a Llama 2 base model with reduced computational overhead and faster training times through sparse transfer.
  • Applications requiring code generation: Shows a slight improvement over the base Llama-2-7b on the HumanEval benchmark.