Name: fraQtl/Llama-3.2-3B-fraQtl-kv API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: fraQtl

Overview

The fraQtl/Llama-3.2-3B-fraQtl-kv is a specialized version of the Llama 3.2 3B model, developed by fraQtl, focusing on KV cache optimization for efficient inference. This model is engineered to reduce GPU memory consumption during runtime by up to 3.5 times compared to the original meta-llama/Llama-3.2-3B.

Key Features & Optimizations

Reduced KV Cache Memory: Achieves up to 3.5x less KV cache memory usage during inference, which is crucial for deploying large language models on hardware with limited memory.
Weight-Level Modification: The optimization is implemented by modifying V projection weights, ensuring memory savings occur at runtime rather than affecting the model's file size.
Comparable Perplexity: The model maintains a perplexity (PPL) of 14.8613, which is very close to the original model's 14.3943, indicating minimal impact on language generation quality.
INT3 Configuration: Utilizes a k=32, INT3 configuration for its internal processing.

Usage Considerations

Runtime Fork Required: To leverage the KV cache memory savings, a specific fraQtl runtime fork is necessary, which is expected to be available soon.
Model File Size: The model's file size remains the same as the original Llama-3.2-3B (~6.4GB), as the optimization is applied at the inference stage.

Ideal Use Cases

This model is particularly well-suited for scenarios where:

Memory Efficiency is Critical: Deploying LLMs on edge devices, mobile platforms, or cloud environments with strict memory constraints.
Cost-Effective Inference: Reducing GPU memory usage can lead to lower operational costs for inference.
Maintaining Performance: Users need a Llama 3.2 3B-level performance with significant memory footprint reduction.

Overview

Overview

Key Features & Optimizations

Usage Considerations

Ideal Use Cases

Full Model Card (README)