zhangsq-nju/Qwen3-0.6B-EdgeRazor-1.88bit

TEXT GENERATIONConcurrency Cost:1Model Size:0.8BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Apr 13, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The zhangsq-nju/Qwen3-0.6B-EdgeRazor-1.88bit model is a 0.6 billion parameter language model based on the Qwen3 architecture, developed by zhangsq-nju. It features a highly optimized 1.88-bit mixed-precision quantization for all decoder layers and 4-bit for embedding and lm_head, achieved through the EdgeRazor framework. This model is specifically designed for efficient deployment on edge devices, offering a balance between performance and extreme resource constraint.

Loading preview...

Model Overview: Qwen3-0.6B-EdgeRazor-1.88bit

This model is a highly quantized version of the Qwen3-0.6B base model, developed by zhangsq-nju using their EdgeRazor framework. Its primary differentiator is the aggressive mixed-precision quantization, specifically utilizing a 1.88-bit bit-width for all decoder layers and 4-bit for embedding and lm_head. This makes it exceptionally lightweight and suitable for environments with severe memory and computational constraints.

Key Capabilities & Features

  • Extreme Quantization: Achieves a 1.88-bit average bit-width for core components, significantly reducing model size and inference cost.
  • Edge Deployment: Optimized for deployment on resource-limited edge devices where traditional LLMs are impractical.
  • Performance Trade-off: While highly compressed, it maintains a competitive average performance of 41.76 across various benchmarks (ARC-e, HellaS., MMLU, GSM8K, etc.) compared to its less quantized counterparts, demonstrating effective quantization-aware distillation.
  • Easy Integration: Provides a straightforward transformers library quickstart for inference, including support for activation and KV cache quantization via trust_remote_code=True.

Should You Use This Model?

This model is ideal for use cases where:

  • Resource Constraints are Paramount: You need an LLM to run on devices with very limited memory or processing power.
  • Efficiency is Critical: Minimizing inference latency and energy consumption is a top priority.
  • Small Model Footprint: A compact model size is essential for deployment or distribution.

It's important to note that while performance is optimized for its size, there is an inherent trade-off compared to full-precision or less quantized models. Evaluate its benchmark scores against your specific task requirements.