fraQtl/Llama-3.2-3B-fraQtl-kv

TEXT GENERATIONConcurrency Cost:1Model Size:3.2BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Apr 10, 2026License:otherArchitecture:Transformer Cold

The fraQtl/Llama-3.2-3B-fraQtl-kv model is a 3.2 billion parameter Llama 3.2 variant developed by fraQtl, optimized for significantly reduced KV cache memory usage during inference. This model achieves up to 3.5x less KV cache memory consumption compared to the original Llama-3.2-3B, making it highly efficient for deployment in memory-constrained environments. It maintains comparable perplexity to the base model while offering substantial runtime memory savings.

Loading preview...

Overview

The fraQtl/Llama-3.2-3B-fraQtl-kv is a specialized version of the Llama 3.2 3B model, developed by fraQtl, focusing on KV cache optimization for efficient inference. This model is engineered to reduce GPU memory consumption during runtime by up to 3.5 times compared to the original meta-llama/Llama-3.2-3B.

Key Features & Optimizations

  • Reduced KV Cache Memory: Achieves up to 3.5x less KV cache memory usage during inference, which is crucial for deploying large language models on hardware with limited memory.
  • Weight-Level Modification: The optimization is implemented by modifying V projection weights, ensuring memory savings occur at runtime rather than affecting the model's file size.
  • Comparable Perplexity: The model maintains a perplexity (PPL) of 14.8613, which is very close to the original model's 14.3943, indicating minimal impact on language generation quality.
  • INT3 Configuration: Utilizes a k=32, INT3 configuration for its internal processing.

Usage Considerations

  • Runtime Fork Required: To leverage the KV cache memory savings, a specific fraQtl runtime fork is necessary, which is expected to be available soon.
  • Model File Size: The model's file size remains the same as the original Llama-3.2-3B (~6.4GB), as the optimization is applied at the inference stage.

Ideal Use Cases

This model is particularly well-suited for scenarios where:

  • Memory Efficiency is Critical: Deploying LLMs on edge devices, mobile platforms, or cloud environments with strict memory constraints.
  • Cost-Effective Inference: Reducing GPU memory usage can lead to lower operational costs for inference.
  • Maintaining Performance: Users need a Llama 3.2 3B-level performance with significant memory footprint reduction.