zhangsq-nju/Qwen3-0.6B-EdgeRazor-4bit

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.8BQuant:BF16Ctx Length:32kPublished:Apr 13, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

The zhangsq-nju/Qwen3-0.6B-EdgeRazor-4bit is a 0.6 billion parameter language model based on the Qwen3 architecture, developed by zhangsq-nju. This model is specifically optimized for efficient deployment on edge devices through 4-bit mixed-precision quantization across all embedding, decoder, and lm_head layers. It aims to provide competitive performance for lightweight LLM applications while significantly reducing memory footprint and computational requirements.

Loading preview...

Model Overview

This model, zhangsq-nju/Qwen3-0.6B-EdgeRazor-4bit, is a 0.6 billion parameter language model derived from the Qwen/Qwen3-0.6B base model. It has been fine-tuned and quantized using the EdgeRazor framework, developed by zhangsq-nju, to achieve high efficiency for edge deployments.

Key Features & Quantization

  • Base Model: Qwen/Qwen3-0.6B.
  • Quantization: Utilizes a 4-bit mixed-precision quantization scheme, applying 4-bit quantization to all embedding, decoder, and lm_head layers. This is the most aggressive 4-bit configuration offered by EdgeRazor, aiming for maximum compression.
  • Performance: Despite aggressive 4-bit quantization, the model demonstrates competitive performance across various benchmarks compared to the original 16-bit Qwen3-0.6B. For instance, the 4-bit EdgeRazor (4-16-16) achieves an average score of 47.83, slightly surpassing the base Qwen3-0.6B's 47.35 in the provided benchmarks.

Use Cases

This model is particularly well-suited for scenarios requiring:

  • Resource-constrained environments: Ideal for deployment on edge devices, mobile applications, or embedded systems where memory and computational power are limited.
  • Efficient inference: The 4-bit quantization significantly reduces the model's footprint and accelerates inference speed.
  • Lightweight LLM applications: Suitable for tasks where a smaller, faster model is preferred over larger, more computationally intensive alternatives, while maintaining reasonable performance.