zhangsq-nju/Qwen3-0.6B-EdgeRazor-1.58bit

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.8BQuant:BF16Ctx Length:32kPublished:Apr 13, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The zhangsq-nju/Qwen3-0.6B-EdgeRazor-1.58bit model is a highly quantized version of the Qwen3-0.6B base model, developed by zhangsq-nju using the EdgeRazor framework. It features a mixed-precision quantization scheme, with all decoder layers quantized to 1.58-bit and embedding/lm_head layers at 4-bit, significantly reducing its memory footprint. This model is specifically optimized for deployment on edge devices and resource-constrained environments, offering a balance between performance and extreme efficiency for lightweight LLM applications.

Loading preview...

EdgeRazor for Lightweight LLMs

This model, zhangsq-nju/Qwen3-0.6B-EdgeRazor-1.58bit, is a highly optimized, quantized version of the Qwen3-0.6B base model. Developed by zhangsq-nju using the EdgeRazor framework, it implements a mixed-precision quantization strategy to achieve extreme efficiency. Specifically, all decoder layers are quantized to a very low 1.58-bit, while the embedding and lm_head layers maintain 4-bit precision. This aggressive quantization significantly reduces the model's size and computational requirements, making it ideal for deployment on edge devices and in environments with limited resources.

Key Capabilities

  • Extreme Quantization: Achieves a 1.58-bit average bit-width across its core layers, drastically minimizing memory footprint.
  • Edge Deployment: Designed for efficient inference on resource-constrained hardware.
  • Instruction-Tuned: Optimized for instruct mode, as indicated by the enable_thinking=False setting in the quickstart example.
  • Performance-Efficiency Trade-off: Provides a balance between maintaining reasonable performance on various benchmarks (e.g., ARC-e, HellaS., MMLU) and achieving ultra-low bit-width.

Good for

  • Deploying LLMs on edge devices or embedded systems.
  • Applications requiring minimal memory and computational overhead.
  • Scenarios where a slight trade-off in raw performance is acceptable for significant efficiency gains.
  • Research and development in ultra-low-bit quantization for large language models.