featherless-ai/QRWKV-QwQ-32B

Cold
Public
32B
FP8
32768
License: apache-2.0
Hugging Face
Overview

QRWKV-QwQ-32B: Efficient Linear Attention Language Model

QRWKV-QwQ-32B is a 32 billion parameter language model developed by featherless-ai, built upon the Qwen 2.5 QwQ 32B architecture. This model leverages a RWKV variant with linear attention, a technique designed to drastically reduce computational costs and improve inference efficiency, especially for extended context lengths up to 32768 tokens. The development process involved converting the Qwen 2.5 QwQ 32B into an RWKV variant without requiring a full pre-training or retraining from scratch, demonstrating an efficient method for integrating linear attention.

Key Capabilities & Performance

This model inherits its core knowledge and dataset training from its Qwen parent, supporting approximately 30 languages. Benchmarks indicate competitive performance against its base model, Qwen/QwQ-32B, and other larger models like Qwen2.5-72B-Instruct across various tasks:

  • arc_challenge: Achieves 0.5640 acc_norm, outperforming Qwen/QwQ-32B.
  • winogrande: Scores 0.7324 acc, surpassing Qwen/QwQ-32B.
  • sciq: Matches Qwen/QwQ-32B with 0.9630 acc.

Unique Approach

The model's core innovation lies in its use of linear attention, which enables over a 1000x improvement in inference costs, facilitating more accessible and efficient AI. This approach allows for testing and validating efficient RWKV linear attention with a smaller budget, making advanced language models more practical for a wider range of applications. Further details on the underlying research can be found in the paper RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale.