zhangsq-nju/Qwen3-1.7B-EdgeRazor-1.88bit

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Apr 13, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The zhangsq-nju/Qwen3-1.7B-EdgeRazor-1.88bit model is a highly quantized version of the Qwen3-1.7B base model, developed by zhangsq-nju using the EdgeRazor framework. It features 1.88-bit quantization for all decoder layers and 4-bit for embedding and lm_head, significantly reducing model size and computational requirements. This model is optimized for efficient deployment on edge devices, offering a balance between performance and extreme resource constraints. It is particularly suited for applications where minimal memory footprint and fast inference are critical.

Loading preview...

Model Overview

This model, zhangsq-nju/Qwen3-1.7B-EdgeRazor-1.88bit, is a highly optimized, quantized version of the Qwen/Qwen3-1.7B base model. Developed by zhangsq-nju using the EdgeRazor framework, its primary innovation lies in its aggressive mixed-precision quantization strategy.

Key Quantization Details

  • Decoder Layers: Quantized to an ultra-low 1.88-bit precision.
  • Embedding and LM Head: Quantized to 4-bit precision.
  • This specific configuration (12.5% 4-bit + 87.5% 1.58-bit) results in an average bit-width of 1.88, as detailed in the EdgeRazor paper.

Performance Characteristics

While achieving significant compression, the 1.88-bit EdgeRazor model shows a reduction in average benchmark scores compared to the full-precision Qwen3-1.7B. For instance, the Qwen3-1.7B scores 58.64 on the aggregated benchmark, whereas the 1.88-bit EdgeRazor variant scores 47.14 (with 16-16-16 W-A-KV) or 47.03 (with 8-8-8 W-A-KV). This trade-off prioritizes extreme efficiency over peak accuracy, making it suitable for resource-constrained environments.

Ideal Use Cases

  • Edge Device Deployment: Designed for scenarios where computational resources and memory are severely limited.
  • Low-Latency Inference: Benefits applications requiring very fast response times due to its compact size.
  • Resource-Constrained AI: Suitable for integrating LLM capabilities into embedded systems or mobile applications where larger models are impractical.