Name: fraQtl/TinyLlama-1.1B-optimized API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: fraQtl

Overview

fraQtl/TinyLlama-1.1B-optimized is a specialized version of the 1.1 billion parameter TinyLlama model, developed by fraQtl. Its primary distinction lies in its KV cache optimization, which significantly reduces memory consumption during inference. This model is designed to offer a more efficient runtime experience, particularly in scenarios where GPU memory is a limiting factor.

Key Optimizations & Characteristics

KV Cache Optimization: The model's V projection weights have been modified to achieve 3.5x less KV cache memory during inference, without altering the model's file size.
Performance Trade-off: While optimized for memory, there is a slight perplexity increase (PPL after 15.8782 vs. PPL before 15.5249), indicating a minor impact on language modeling quality for substantial memory savings.
Base Model: Derived from TinyLlama/TinyLlama-1.1B-Chat-v1.0, retaining its core architecture and capabilities.
Configuration: Utilizes k=16, INT3 for its optimization strategy.

When to Use This Model

This model is particularly well-suited for use cases where:

Memory Efficiency is Critical: Ideal for deployment on devices or environments with limited GPU memory, such as edge devices or cost-sensitive cloud instances.
Resource-Constrained Inference: When running multiple models concurrently or processing long sequences where KV cache accumulation can become problematic.
Balancing Performance and Resources: Users who need a small, capable language model and prioritize memory footprint over a marginal increase in perplexity.

Overview

Overview

Key Optimizations & Characteristics

When to Use This Model

Full Model Card (README)