neroai14/Nero-Qwen2.5-1.5B-Surgical
Nero-Qwen2.5-1.5B-Surgical by neroai14 is a 1.5 billion parameter Qwen2.5-based causal language model with a 32768 token context length. It is surgically optimized using the Nero Hybrid Engine, reducing its VRAM footprint by 39.32% while preserving core logic and reasoning. This model is specifically designed for efficient deployment in resource-constrained environments, offering significant VRAM savings through a hybrid low-rank SVD decomposition and INT8 quantization method.
Loading preview...
Overview
This model, Nero-Qwen2.5-1.5B-Surgical, is an optimized version of the Qwen2.5-1.5B-Instruct model, developed by neroai14. It leverages the Nero Hybrid Engine to significantly reduce its VRAM footprint while maintaining its core reasoning capabilities.
Key Optimizations
- VRAM Savings: Achieves a 39.32% reduction in VRAM usage, shrinking from ~3.09 GB to ~1.74 GB.
- Surgical Compression: Employs a unique "surgical" approach rather than blind quantization.
- Hybrid Low-Rank SVD Decomposition: Filters out redundant parameters (noise) using an "Elbow Method" for optimal rank selection.
- Dynamic Protection: Critical layers, such as
self_attnandlm_head, are preserved at higher precision to prevent loss of essential model intelligence. - Hybrid INT8 Quantization: Applies INT8 quantization to the remaining MLP weights for substantial storage gains.
Use Cases
This model is particularly well-suited for scenarios where:
- Resource Efficiency is Critical: Ideal for deployment on devices or platforms with limited VRAM.
- Cost-Effective Inference: Reduces operational costs associated with memory usage.
- Maintaining Core Performance: Designed to retain the original model's logical and reasoning abilities despite compression.
Usage
Integration is straightforward using the transformers library, with specific instructions for loading the model with dtype="auto" to leverage its optimized format.