Overview

This model is a 7 billion parameter variant of Meta's Llama 2 Chat, specifically modified with 30% attention sparsity. The sparsity was achieved using the Wanda pruning method, which is notable for not requiring any retraining or weight updates while aiming to preserve performance. The base Llama 2 architecture is an optimized transformer, pretrained on 2 trillion tokens of publicly available data with a cutoff of September 2022, and fine-tuned with over one million human-annotated examples up to July 2023.

Key Capabilities

Dialogue Optimization: Fine-tuned for assistant-like chat applications, outperforming many open-source chat models.
Efficient Inference: The 30% attention sparsity can lead to more efficient inference compared to the unpruned base model.
Robust Performance: Despite pruning, it is designed to maintain competitive performance, leveraging the strong foundation of the Llama 2 Chat model.
Context Length: Supports a 4096-token context window, suitable for extended conversations.

Good For

Chatbots and Virtual Assistants: Its fine-tuning for dialogue makes it well-suited for conversational AI.
Resource-Constrained Deployments: The reduced sparsity could be beneficial for environments where computational resources or memory are limited.
English Language Tasks: Intended for commercial and research use primarily in English.

Limitations

License Restrictions: Governed by a custom commercial license from Meta, requiring acceptance before use.
Potential for Inaccuracies: Like all LLMs, it may produce inaccurate, biased, or objectionable responses, necessitating safety testing for specific applications.
English Only: Not intended for use in languages other than English.

Overview

Overview

Key Capabilities

Good For

Limitations

Full Model Card (README)