Overview
QRWKV6-32B-Instruct-Preview-v0.1: A Large-Scale RWKV Model
recursal/QRWKV6-32B-Instruct-Preview-v0.1 is a 32 billion parameter instruction-tuned model developed by Recursal, showcasing a significant advancement in the RWKV (Receptance Weighted Key Value) architecture. This model is notable for its unique approach: it's derived from a QKV Attention-based model (specifically Qwen2.5-32B-Instruct) through a conversion process, rather than being trained from scratch. This method allows for rapid validation of the efficient RWKV linear attention mechanism at a larger scale.
Key Capabilities and Differentiators
- Computational Efficiency: RWKV linear models are designed to drastically reduce computational costs, particularly for long context lengths, offering over a 1000x improvement in inference cost efficiency compared to traditional transformer architectures.
- Performance: Evaluation benchmarks indicate that QRWKV6-32B-Instruct performs comparably to or even surpasses its base model, Qwen2.5-32B-Instruct, across several metrics like arc_challenge, piqa, and sciq, while being slightly behind on MMLU and hellaSwag.
- Architectural Innovation: It demonstrates the scalability and architectural design of RWKV, proving that QKV attention is not the sole essential component for high-performing LLMs.
- Inherited Knowledge: The model inherits its inherent knowledge and dataset training from its "parent" Qwen model, supporting approximately 30 languages.
Limitations and Future Directions
- Context Length: Due to compute constraints, the current model was trained up to a 16K token context length, though it is stable beyond this limit.
- Inference Code: This specific model requires separate inference code due to the lack of RWKV-based channel mix and feedforward layers.
Recursal plans to release Q-RWKV-7 32B and LLaMA-RWKV-7 70B, along with detailed conversion methodologies and a paper, following the finalization of RWKV-7.