wang7776/Llama-2-7b-chat-hf-30-attention-sparsity
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Jan 26, 2024License:otherArchitecture:Transformer Cold

This is a 7 billion parameter Llama 2 Chat model, developed by Meta, that has been pruned to 30% attention sparsity using the Wanda method. This pruning technique reduces model size without requiring retraining or weight updates, aiming to maintain competitive performance. The model is fine-tuned for dialogue use cases and supports a 4096-token context length, making it suitable for assistant-like chat applications.

Loading preview...

Overview

This model is a 7 billion parameter variant of Meta's Llama 2 Chat, specifically modified with 30% attention sparsity. The sparsity was achieved using the Wanda pruning method, which is notable for not requiring any retraining or weight updates while aiming to preserve performance. The base Llama 2 architecture is an optimized transformer, pretrained on 2 trillion tokens of publicly available data with a cutoff of September 2022, and fine-tuned with over one million human-annotated examples up to July 2023.

Key Capabilities

  • Dialogue Optimization: Fine-tuned for assistant-like chat applications, outperforming many open-source chat models.
  • Efficient Inference: The 30% attention sparsity can lead to more efficient inference compared to the unpruned base model.
  • Robust Performance: Despite pruning, it is designed to maintain competitive performance, leveraging the strong foundation of the Llama 2 Chat model.
  • Context Length: Supports a 4096-token context window, suitable for extended conversations.

Good For

  • Chatbots and Virtual Assistants: Its fine-tuning for dialogue makes it well-suited for conversational AI.
  • Resource-Constrained Deployments: The reduced sparsity could be beneficial for environments where computational resources or memory are limited.
  • English Language Tasks: Intended for commercial and research use primarily in English.

Limitations

  • License Restrictions: Governed by a custom commercial license from Meta, requiring acceptance before use.
  • Potential for Inaccuracies: Like all LLMs, it may produce inaccurate, biased, or objectionable responses, necessitating safety testing for specific applications.
  • English Only: Not intended for use in languages other than English.