Name: neroai14/Nero-Qwen2.5-1.5B-Surgical API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: neroai14

Overview

This model, Nero-Qwen2.5-1.5B-Surgical, is an optimized version of the Qwen2.5-1.5B-Instruct model, developed by neroai14. It leverages the Nero Hybrid Engine to significantly reduce its VRAM footprint while maintaining its core reasoning capabilities.

Key Optimizations

VRAM Savings: Achieves a 39.32% reduction in VRAM usage, shrinking from ~3.09 GB to ~1.74 GB.
Surgical Compression: Employs a unique "surgical" approach rather than blind quantization.
Hybrid Low-Rank SVD Decomposition: Filters out redundant parameters (noise) using an "Elbow Method" for optimal rank selection.
Dynamic Protection: Critical layers, such as self_attn and lm_head, are preserved at higher precision to prevent loss of essential model intelligence.
Hybrid INT8 Quantization: Applies INT8 quantization to the remaining MLP weights for substantial storage gains.

Use Cases

This model is particularly well-suited for scenarios where:

Resource Efficiency is Critical: Ideal for deployment on devices or platforms with limited VRAM.
Cost-Effective Inference: Reduces operational costs associated with memory usage.
Maintaining Core Performance: Designed to retain the original model's logical and reasoning abilities despite compression.

Usage

Integration is straightforward using the transformers library, with specific instructions for loading the model with dtype="auto" to leverage its optimized format.

Overview

Overview

Key Optimizations

Use Cases

Usage

Full Model Card (README)