Name: fraQtl/Qwen-2.5-3B-optimized API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: fraQtl

fraQtl/Qwen-2.5-3B-optimized: KV Cache Efficiency

The fraQtl/Qwen-2.5-3B-optimized model is a specialized version of the Qwen 2.5 3B language model, engineered by fraQtl to significantly reduce KV cache memory consumption during inference. This optimization allows the model to use up to 3.5 times less GPU memory for the KV cache, addressing a critical bottleneck in deploying large language models, especially in resource-limited settings.

Key Optimizations & Features

KV Cache Reduction: Achieves up to 3.5x less KV cache memory usage during inference, requiring a fraQtl runtime fork (shipping soon).
Weight-Level Modification: The optimization is implemented by modifying V projection weights, maintaining the original model file size of approximately 6.2GB.
Minimal Performance Impact: The perplexity (PPL) delta compared to the original Qwen/Qwen2.5-3B is a minor +0.308 (from 14.4222 to 14.7302), indicating high quality retention.
Configuration: Utilizes k=32 and INT3 for its internal configuration.

Ideal Use Cases

Memory-Constrained Deployments: Excellent for environments where GPU memory is a premium, such as edge devices or cost-sensitive cloud instances.
High-Throughput Inference: By reducing memory footprint, it can potentially enable higher batch sizes or more concurrent inferences on the same hardware.
Efficient LLM Integration: Suitable for developers looking to integrate a capable 3B parameter model with enhanced operational efficiency.

Overview

fraQtl/Qwen-2.5-3B-optimized: KV Cache Efficiency

Key Optimizations & Features

Ideal Use Cases

Full Model Card (README)