zhangsq-nju/Qwen3-1.7B-EdgeRazor-2.79bit

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Apr 13, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The zhangsq-nju/Qwen3-1.7B-EdgeRazor-2.79bit model is a 1.7 billion parameter language model based on Qwen3-1.7B, developed by zhangsq-nju. It features a mixed-precision quantization scheme, specifically 2.79-bit for all decoder layers and 4-bit for embedding and lm_head, achieved through the EdgeRazor framework. This quantization significantly reduces model size while maintaining competitive performance, making it suitable for deployment on edge devices or environments with limited computational resources. Its primary use case is efficient inference for general language tasks where resource constraints are a concern.

Loading preview...

Overview

zhangsq-nju/Qwen3-1.7B-EdgeRazor-2.79bit is a highly optimized, lightweight large language model derived from the Qwen3-1.7B base model. Developed by zhangsq-nju using their EdgeRazor framework, this model implements a mixed-precision quantization strategy to significantly reduce its memory footprint and computational requirements. Specifically, it utilizes a 2.79-bit quantization for all decoder layers, combined with 4-bit quantization for the embedding and lm_head components.

This model is designed for efficient deployment in resource-constrained environments, such as edge devices, where minimizing model size and maximizing inference speed are critical. The EdgeRazor framework employs quantization-aware distillation to achieve this balance, aiming to preserve performance while drastically reducing bit-width.

Key Capabilities

  • Highly Quantized: Achieves a 2.79-bit average precision across decoder layers, leading to a compact model size.
  • Efficient Inference: Optimized for faster execution and lower memory consumption on hardware with limited resources.
  • Performance Retention: Benchmarks show competitive performance compared to the full-precision Qwen3-1.7B, with an average score of 53.33 across various tasks for the 2.79-bit configuration.
  • Instruction-Tuned: Supports instruction-following for chat-based applications.

Good for

  • Edge Device Deployment: Ideal for running LLM applications directly on mobile phones, IoT devices, or other embedded systems.
  • Resource-Constrained Environments: Suitable for scenarios where GPU memory or computational power is limited.
  • Cost-Effective Inference: Reduces the operational costs associated with running large language models.
  • General Language Tasks: Capable of handling a wide range of natural language understanding and generation tasks efficiently.