open-machine/Qwen3-1.7B-FlashNorm
open-machine/Qwen3-1.7B-FlashNorm is a 2 billion parameter causal language model, a FlashNorm-prepared checkpoint of Qwen/Qwen3-1.7B, developed by OpenMachine. This model incorporates the FlashNorm optimization, which folds normalization weights into linear layers, resulting in a mathematically equivalent model with improved inference efficiency. It is designed for developers seeking a performant Qwen3-1.7B variant with optimized normalization for faster processing.
Loading preview...
Overview
This model, open-machine/Qwen3-1.7B-FlashNorm, is a 2 billion parameter variant of the Qwen3-1.7B model, specifically prepared with the FlashNorm optimization. Developed by OpenMachine, this checkpoint implements the techniques described in the paper "FlashNorm: Fast Normalization for Transformers" by Graef, Clapp, and Wasielewski.
What is FlashNorm?
FlashNorm is an exact reformulation of the RMSNorm -> Linear operation. It works by:
- Folding the per-channel normalization weight
ginto the subsequent linear layer's weight matrix (W_star = W @ diag(g)). This is a one-time computation during checkpoint conversion. - After folding, the RMSNorm layer no longer has learnable per-channel scales; it simply divides by the root mean square of the input (
rms(x)).
This optimization results in a model that computes the same output as the original but can offer improved inference performance due to the simplified normalization step. The model is mathematically equivalent to the source Qwen/Qwen3-1.7B.
Usage and Compatibility
- HuggingFace Transformers: The model loads correctly with HuggingFace Transformers. An expected warning about missing norm weights will appear, as Transformers defaults these to ones, which is the correct behavior for a FlashNorm checkpoint.
- vLLM: Currently, vLLM does not support loading this checkpoint due to the absence of norm weight tensors. Support is being tracked upstream.
Key Differentiator
- Optimized Normalization: The primary advantage of this model is its FlashNorm integration, which aims to provide faster inference by optimizing the normalization layers without altering the model's mathematical output or performance.