open-machine/Qwen3-1.7B-FlashNorm

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Apr 22, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

open-machine/Qwen3-1.7B-FlashNorm is a 2 billion parameter causal language model, a FlashNorm-prepared checkpoint of Qwen/Qwen3-1.7B, developed by OpenMachine. This model incorporates the FlashNorm optimization, which folds normalization weights into linear layers, resulting in a mathematically equivalent model with improved inference efficiency. It is designed for developers seeking a performant Qwen3-1.7B variant with optimized normalization for faster processing.

Loading preview...

Overview

This model, open-machine/Qwen3-1.7B-FlashNorm, is a 2 billion parameter variant of the Qwen3-1.7B model, specifically prepared with the FlashNorm optimization. Developed by OpenMachine, this checkpoint implements the techniques described in the paper "FlashNorm: Fast Normalization for Transformers" by Graef, Clapp, and Wasielewski.

What is FlashNorm?

FlashNorm is an exact reformulation of the RMSNorm -> Linear operation. It works by:

  • Folding the per-channel normalization weight g into the subsequent linear layer's weight matrix (W_star = W @ diag(g)). This is a one-time computation during checkpoint conversion.
  • After folding, the RMSNorm layer no longer has learnable per-channel scales; it simply divides by the root mean square of the input (rms(x)).

This optimization results in a model that computes the same output as the original but can offer improved inference performance due to the simplified normalization step. The model is mathematically equivalent to the source Qwen/Qwen3-1.7B.

Usage and Compatibility

  • HuggingFace Transformers: The model loads correctly with HuggingFace Transformers. An expected warning about missing norm weights will appear, as Transformers defaults these to ones, which is the correct behavior for a FlashNorm checkpoint.
  • vLLM: Currently, vLLM does not support loading this checkpoint due to the absence of norm weight tensors. Support is being tracked upstream.

Key Differentiator

  • Optimized Normalization: The primary advantage of this model is its FlashNorm integration, which aims to provide faster inference by optimizing the normalization layers without altering the model's mathematical output or performance.