wang7776/Mistral-7B-Instruct-v0.2-attention-sparsity-20

Cold
Public
7B
FP8
8192
License: apache-2.0
Hugging Face
Overview

Overview

This model, wang7776/Mistral-7B-Instruct-v0.2-attention-sparsity-20, is a 7 billion parameter instruction-tuned language model derived from Mistral-7B-Instruct-v0.2. Its key differentiator is the application of the Wanda pruning method to its attention layers, achieving 20% sparsity. This process is notable because it requires no retraining or weight updates, yet aims to maintain competitive performance while reducing the model's computational footprint.

Key Characteristics

  • Pruned Architecture: Features 20% attention sparsity, making it more efficient for inference.
  • Base Model: Built upon Mistral-7B-Instruct-v0.2, an improved instruction-tuned version of Mistral-7B-Instruct-v0.1.
  • Core Technologies: Incorporates Grouped-Query Attention and Sliding-Window Attention for efficient processing of longer contexts.
  • Instruction Following: Designed to respond to instructions, utilizing a specific [INST] and [/INST] token format for prompts.

Use Cases

This model is suitable for applications where reduced model size and faster inference are critical, without significantly compromising the instruction-following capabilities of the original Mistral-7B-Instruct-v0.2. It's particularly useful for:

  • Resource-constrained environments: Deployments on devices with limited memory or processing power.
  • Instruction-based tasks: Generating responses to user prompts and following specific instructions.

Limitations

As an instruction-tuned model, it lacks inherent moderation mechanisms. Users should implement their own guardrails for deployments requiring moderated outputs.