wang7776/Llama-2-7b-chat-hf-10-attention-sparsity
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Jan 26, 2024License:otherArchitecture:Transformer Cold

The wang7776/Llama-2-7b-chat-hf-10-attention-sparsity model is a 7 billion parameter Llama 2-based generative text model developed by Meta, fine-tuned for dialogue use cases. This specific variant has been pruned to 10% attention sparsity using the Wanda method, which aims to maintain competitive performance without retraining. It utilizes an optimized transformer architecture and supports a 4096-token context length, making it suitable for efficient chat applications.

Loading preview...

Overview

This model is a 7 billion parameter variant of Meta's Llama 2-Chat, specifically optimized for dialogue. Its key differentiator is the application of 10% attention sparsity using the Wanda pruning method. This technique allows for a reduction in model size and potentially faster inference without requiring additional retraining or weight updates, while still aiming for competitive performance compared to its dense counterpart.

Key Capabilities

  • Dialogue Optimization: Fine-tuned for assistant-like chat applications.
  • Efficient Inference: Benefits from 10% attention sparsity, potentially leading to reduced computational requirements.
  • Llama 2 Foundation: Built upon the robust Llama 2 architecture, which uses an optimized transformer and incorporates supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) for helpfulness and safety.
  • 4096-token Context: Supports a standard context window for conversational tasks.

Good For

  • Developers seeking a more efficient version of Llama 2-7b-chat for dialogue generation.
  • Applications where computational resources are a concern, and a balance between performance and efficiency is desired.
  • Research into the effects and benefits of pruning techniques like Wanda on large language models.