MilyaShams/Qwen3-1.7B-SparseGPT_unstruct_0.7_bs128_damp0.1

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Apr 8, 2026Architecture:Transformer Cold

MilyaShams/Qwen3-1.7B-SparseGPT_unstruct_0.7_bs128_damp0.1 is a 2 billion parameter language model based on the Qwen3-1.7B architecture, compressed using the llmcompressor framework. This model features a 32768 token context length and incorporates SparseGPT unstructured sparsity at a 70% ratio. It is optimized for efficient deployment in scenarios where reduced model size and faster inference are critical.

Loading preview...

Overview

This model, MilyaShams/Qwen3-1.7B-SparseGPT_unstruct_0.7_bs128_damp0.1, is a compressed version of the Qwen/Qwen3-1.7B base model. It has approximately 2 billion parameters and maintains a substantial context length of 32768 tokens. The compression was performed using the llmcompressor framework, specifically employing the SparseGPT method.

Compression Details

  • Base Model: Qwen/Qwen3-1.7B
  • Compression Method: SparseGPT with unstructured sparsity.
  • Sparsity Ratio: 70% (meaning 70% of the model's weights have been pruned).
  • Block Size: 128
  • Dampening Fraction: 0.1

Key Characteristics

  • Reduced Size: Significant reduction in model size due to 70% sparsity, leading to lower memory footprint.
  • Efficient Inference: Designed for faster inference speeds compared to its dense counterpart, making it suitable for resource-constrained environments.
  • High Context Length: Retains the original 32768 token context window, allowing for processing long inputs.

Potential Use Cases

This model is particularly well-suited for applications requiring:

  • Edge device deployment: Where computational resources and memory are limited.
  • High-throughput inference: For scenarios demanding quick responses from the model.
  • Cost-effective solutions: Reducing operational costs associated with larger, dense models.

It offers a balance between performance and efficiency, making it a strong candidate for various NLP tasks where the full capacity of a dense model might be overkill.