fraQtl/Llama-3.2-3B-optimized
fraQtl/Llama-3.2-3B-optimized is a 3.2 billion parameter causal language model developed by fraQtl, based on the Llama 3.2 architecture. This model is specifically optimized for inference efficiency, featuring up to 3.5x less KV cache memory usage during runtime. It maintains a 32768 token context length and is designed for applications requiring reduced GPU memory footprint without significant quality degradation.
Loading preview...
fraQtl/Llama-3.2-3B-optimized: KV Cache Efficiency
This model, developed by fraQtl, is an optimized version of the Llama 3.2 3B architecture, specifically engineered to reduce KV cache memory consumption during inference. It achieves up to 3.5x less KV cache memory usage at runtime, which is a significant advantage for deploying large language models on resource-constrained hardware or for high-throughput scenarios. This optimization is implemented by modifying V projection weights, requiring a fraQtl runtime fork for full benefits.
Key Characteristics
- Base Model: Derived from
meta-llama/Llama-3.2-3B. - Parameter Count: 3.2 billion parameters.
- Context Length: Supports a 32768 token context window.
- Memory Optimization: Reduces KV cache memory by up to 3.5x during inference, making it highly efficient.
- Quality Impact: The optimization introduces a minimal perplexity delta of +0.467 (weight-level) compared to the original, indicating a small trade-off for substantial memory savings.
- Configuration: Uses
k=32, INT3for its internal configuration.
Ideal Use Cases
- Edge Devices: Deploying LLMs on devices with limited GPU memory.
- Cost-Sensitive Inference: Reducing operational costs by requiring less powerful or fewer GPUs.
- High-Throughput Applications: Serving more concurrent requests due to optimized memory usage.
- Long Context Processing: Efficiently handling long sequences within memory constraints.