Name: fraQtl/Llama-3.2-3B-optimized API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: fraQtl

fraQtl/Llama-3.2-3B-optimized: KV Cache Efficiency

This model, developed by fraQtl, is an optimized version of the Llama 3.2 3B architecture, specifically engineered to reduce KV cache memory consumption during inference. It achieves up to 3.5x less KV cache memory usage at runtime, which is a significant advantage for deploying large language models on resource-constrained hardware or for high-throughput scenarios. This optimization is implemented by modifying V projection weights, requiring a fraQtl runtime fork for full benefits.

Key Characteristics

Base Model: Derived from meta-llama/Llama-3.2-3B.
Parameter Count: 3.2 billion parameters.
Context Length: Supports a 32768 token context window.
Memory Optimization: Reduces KV cache memory by up to 3.5x during inference, making it highly efficient.
Quality Impact: The optimization introduces a minimal perplexity delta of +0.467 (weight-level) compared to the original, indicating a small trade-off for substantial memory savings.
Configuration: Uses k=32, INT3 for its internal configuration.

Ideal Use Cases

Edge Devices: Deploying LLMs on devices with limited GPU memory.
Cost-Sensitive Inference: Reducing operational costs by requiring less powerful or fewer GPUs.
High-Throughput Applications: Serving more concurrent requests due to optimized memory usage.
Long Context Processing: Efficiently handling long sequences within memory constraints.

Overview

fraQtl/Llama-3.2-3B-optimized: KV Cache Efficiency

Key Characteristics

Ideal Use Cases

Full Model Card (README)