wang7776/vicuna-7b-v1.3-attention-sparsity-30
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Jan 26, 2024License:apache-2.0Architecture:Transformer Open Weights Cold

The wang7776/vicuna-7b-v1.3-attention-sparsity-30 is a 7 billion parameter Vicuna v1.3 model, developed by wang7776, that has been pruned to 30% sparsity in its attention layers using the Wanda method. This pruning technique reduces model size and computational requirements without requiring retraining or weight updates, while maintaining competitive performance. It is based on the LLaMA architecture and is primarily intended for research and hobbyist use in natural language processing and chatbot development.

Loading preview...

Overview

This model, wang7776/vicuna-7b-v1.3-attention-sparsity-30, is a specialized version of the 7 billion parameter Vicuna v1.3 model. It has undergone 30% sparsity pruning specifically within its attention layers using the Wanda pruning method. A key advantage of this method is that it achieves significant model compression without requiring any retraining or weight updates, aiming to preserve competitive performance.

Key Capabilities

  • Efficient Inference: Reduced computational requirements due to 30% sparsity in attention layers.
  • Chat Assistant: Based on the Vicuna v1.3 architecture, fine-tuned from LLaMA on user-shared conversations from ShareGPT.
  • Research & Development: Primarily intended for research and hobbyist exploration in large language models and chatbots.

Training Details

The base Vicuna v1.3 model was fine-tuned from LLaMA using supervised instruction fine-tuning on approximately 125K conversations from ShareGPT.com. The sparsity pruning was applied post-training to this base model, as detailed in the Wanda pruning method.