Name: open-machine/Qwen3-1.7B-FlashNorm API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: open-machine

Overview

This model, open-machine/Qwen3-1.7B-FlashNorm, is a 2 billion parameter variant of the Qwen3-1.7B model, specifically prepared with the FlashNorm optimization. Developed by OpenMachine, this checkpoint implements the techniques described in the paper "FlashNorm: Fast Normalization for Transformers" by Graef, Clapp, and Wasielewski.

What is FlashNorm?

FlashNorm is an exact reformulation of the RMSNorm -> Linear operation. It works by:

Folding the per-channel normalization weight g into the subsequent linear layer's weight matrix (W_star = W @ diag(g)). This is a one-time computation during checkpoint conversion.
After folding, the RMSNorm layer no longer has learnable per-channel scales; it simply divides by the root mean square of the input (rms(x)).

This optimization results in a model that computes the same output as the original but can offer improved inference performance due to the simplified normalization step. The model is mathematically equivalent to the source Qwen/Qwen3-1.7B.

Usage and Compatibility

HuggingFace Transformers: The model loads correctly with HuggingFace Transformers. An expected warning about missing norm weights will appear, as Transformers defaults these to ones, which is the correct behavior for a FlashNorm checkpoint.
vLLM: Currently, vLLM does not support loading this checkpoint due to the absence of norm weight tensors. Support is being tracked upstream.

Key Differentiator

Optimized Normalization: The primary advantage of this model is its FlashNorm integration, which aims to provide faster inference by optimizing the normalization layers without altering the model's mathematical output or performance.

Overview

Overview

What is FlashNorm?

Usage and Compatibility

Key Differentiator

Full Model Card (README)