wang7776/vicuna-7b-v1.3-attention-sparsity-20
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Jan 25, 2024License:apache-2.0Architecture:Transformer Open Weights Cold

The wang7776/vicuna-7b-v1.3-attention-sparsity-20 model is a 7 billion parameter auto-regressive language model based on the Vicuna v1.3 architecture, developed by LMSYS. This version has been pruned to 20% sparsity in its attention layers using the Wanda method, which aims to maintain competitive performance without retraining. It is primarily intended for research and hobbyist use in natural language processing and chatbots, offering a more efficient alternative to the base Vicuna model.

Loading preview...

Overview

This model, wang7776/vicuna-7b-v1.3-attention-sparsity-20, is a 7 billion parameter variant of the Vicuna v1.3 chat assistant, originally developed by LMSYS. It is fine-tuned from LLaMA on approximately 125K user-shared conversations from ShareGPT.com. The key differentiator for this specific model is its 20% attention layer sparsity, achieved through the Wanda pruning method. This technique allows for significant model compression without requiring retraining or weight updates, aiming to preserve performance while improving efficiency.

Key Capabilities

  • Efficient Inference: Reduced computational load due to 20% sparsity in attention layers.
  • Chat Assistant: Designed for conversational AI tasks, inheriting Vicuna's chat capabilities.
  • Research & Development: Suitable for exploring sparse model architectures and their practical applications.

Good For

  • Researchers and hobbyists experimenting with pruned language models.
  • Applications where computational efficiency and reduced model size are critical, without a drastic drop in performance.
  • Studying the impact of attention sparsity on large language models.