neroai14/Nero-Qwen2.5-1.5B-Surgical

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 17, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

Nero-Qwen2.5-1.5B-Surgical by neroai14 is a 1.5 billion parameter Qwen2.5-based causal language model with a 32768 token context length. It is surgically optimized using the Nero Hybrid Engine, reducing its VRAM footprint by 39.32% while preserving core logic and reasoning. This model is specifically designed for efficient deployment in resource-constrained environments, offering significant VRAM savings through a hybrid low-rank SVD decomposition and INT8 quantization method.

Loading preview...

Overview

This model, Nero-Qwen2.5-1.5B-Surgical, is an optimized version of the Qwen2.5-1.5B-Instruct model, developed by neroai14. It leverages the Nero Hybrid Engine to significantly reduce its VRAM footprint while maintaining its core reasoning capabilities.

Key Optimizations

  • VRAM Savings: Achieves a 39.32% reduction in VRAM usage, shrinking from ~3.09 GB to ~1.74 GB.
  • Surgical Compression: Employs a unique "surgical" approach rather than blind quantization.
  • Hybrid Low-Rank SVD Decomposition: Filters out redundant parameters (noise) using an "Elbow Method" for optimal rank selection.
  • Dynamic Protection: Critical layers, such as self_attn and lm_head, are preserved at higher precision to prevent loss of essential model intelligence.
  • Hybrid INT8 Quantization: Applies INT8 quantization to the remaining MLP weights for substantial storage gains.

Use Cases

This model is particularly well-suited for scenarios where:

  • Resource Efficiency is Critical: Ideal for deployment on devices or platforms with limited VRAM.
  • Cost-Effective Inference: Reduces operational costs associated with memory usage.
  • Maintaining Core Performance: Designed to retain the original model's logical and reasoning abilities despite compression.

Usage

Integration is straightforward using the transformers library, with specific instructions for loading the model with dtype="auto" to leverage its optimized format.