Name: RedHatAI/TinyLlama-1.1B-Chat-v1.0-pruned2.4 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: RedHatAI

RedHatAI/TinyLlama-1.1B-Chat-v1.0-pruned2.4 Overview

This model is a pruned version of the TinyLlama-1.1B-Chat-v1.0, developed by RedHatAI. It features 1.1 billion parameters and a context length of 2048 tokens. The primary distinction of this model lies in its optimization through SparseGPT and SparseML, which applies a semi-structured sparsity (2:4 mask structure) to the model weights.

Key Capabilities & Features

Sparsified Architecture: Pruned using SparseGPT for improved efficiency.
Optimized for NM-vLLM: Specifically designed to leverage the high-throughput serving and low memory usage capabilities of the NM-vLLM engine.
Chat Fine-tuned: Based on a chat-tuned TinyLlama model, suitable for conversational AI tasks.
Efficient Inference: Aims to provide faster inference and reduced memory footprint compared to its dense counterpart, especially when used with NM-vLLM.

When to Use This Model

This model is particularly well-suited for use cases where:

Resource Efficiency is Critical: Its pruned nature makes it ideal for deployment on hardware with limited memory or computational resources.
High-Throughput Inference is Required: When paired with NM-vLLM, it can achieve faster serving speeds for chat applications.
Small, Capable Chat Models are Needed: Provides a compact yet effective solution for conversational AI tasks without the overhead of larger models.

For detailed information on the sparsification process, the recipe.yaml file within the repository outlines the methodology.