RedHatAI/TinyLlama-1.1B-Chat-v1.0-pruned2.4 Overview
This model is a pruned version of the TinyLlama-1.1B-Chat-v1.0, developed by RedHatAI. It features 1.1 billion parameters and a context length of 2048 tokens. The primary distinction of this model lies in its optimization through SparseGPT and SparseML, which applies a semi-structured sparsity (2:4 mask structure) to the model weights.
Key Capabilities & Features
- Sparsified Architecture: Pruned using SparseGPT for improved efficiency.
- Optimized for NM-vLLM: Specifically designed to leverage the high-throughput serving and low memory usage capabilities of the NM-vLLM engine.
- Chat Fine-tuned: Based on a chat-tuned TinyLlama model, suitable for conversational AI tasks.
- Efficient Inference: Aims to provide faster inference and reduced memory footprint compared to its dense counterpart, especially when used with NM-vLLM.
When to Use This Model
This model is particularly well-suited for use cases where:
- Resource Efficiency is Critical: Its pruned nature makes it ideal for deployment on hardware with limited memory or computational resources.
- High-Throughput Inference is Required: When paired with NM-vLLM, it can achieve faster serving speeds for chat applications.
- Small, Capable Chat Models are Needed: Provides a compact yet effective solution for conversational AI tasks without the overhead of larger models.
For detailed information on the sparsification process, the recipe.yaml file within the repository outlines the methodology.