TT0518/qwen25-3b-1.58bit-qat

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:Apr 27, 2026License:otherArchitecture:Transformer Cold

The TT0518/qwen25-3b-1.58bit-qat model is a 3.1 billion parameter language model based on Qwen/Qwen2.5-3B, optimized with 1.58-bit Quantization-Aware Training (QAT). This model utilizes a ternary quantization scheme ({-1, 0, +1}) applied to all linear layers, excluding lm_head and embed_tokens. It was fine-tuned on a dataset combining WikiText-103 and Wikipedia JA, making it suitable for efficient language processing tasks with reduced memory footprint.

Loading preview...

Model Overview

This model, TT0518/qwen25-3b-1.58bit-qat, is a specialized version of the Qwen2.5-3B base model, featuring 1.58-bit Quantization-Aware Training (QAT). This advanced quantization technique significantly reduces the model's memory footprint while aiming to preserve performance.

Key Technical Details

  • Base Model: Qwen/Qwen2.5-3B
  • Quantization Method: 1.58-bit QAT using a ternary scheme ({-1, 0, +1})
  • Quantization Scope: Applied to all Linear layers, excluding lm_head and embed_tokens
  • Training Data: Fine-tuned on a mix of WikiText-103 (70%) and Wikipedia JA (30%)
  • Training Process: Involved 50,000 chunks of 512 tokens each, with a two-stage fine-tuning approach.
  • Final Perplexity (PPL): Achieved 43.92, indicating its language modeling capability post-quantization.

File Formats Available

  • Standard HuggingFace *.safetensors (float16) at approximately 6GB.
  • Optimized qwen25_3b_qat_q4km.gguf for GGUF Q4_K_M quantization, weighing around 1.9GB, suitable for local inference with tools like Ollama.

Licensing

The model adheres to the Qwen Research License of its base model. Commercial use requires an application to Alibaba Cloud.