Name: open-machine/Llama-3.1-8B-FlashNorm API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: open-machine

Overview

open-machine/Llama-3.1-8B-FlashNorm is a specialized version of Meta's Llama-3.1-8B model, developed by OpenMachine-ai. This 8 billion parameter model incorporates the FlashNorm technique, which reformulates the RMSNorm -> Linear operation by folding the per-channel normalization weights into the following linear layer. This process, detailed in the FlashNorm paper, results in a model that is mathematically equivalent to the original Llama-3.1-8B but is designed for potentially faster inference by simplifying the RMSNorm layer at runtime.

Key Characteristics

FlashNorm Optimization: The primary differentiator is the integration of FlashNorm, which folds RMSNorm weights into linear layers, removing the per-channel scale from the RMSNorm layer during runtime.
Mathematical Equivalence: Despite the structural changes, the model produces identical outputs to the original Llama-3.1-8B, as proven by Proposition 1 of the FlashNorm paper.
HuggingFace Transformers Support: Fully compatible with HuggingFace Transformers, though it will issue a warning about missing norm weights, which is expected and handled correctly by defaulting them to ones.
vLLM Compatibility: Currently, vLLM does not natively support this checkpoint due to the absence of norm weight tensors, but a patch is being tracked.

Usage Considerations

This model is ideal for developers looking to leverage the Llama-3.1-8B architecture with potential inference optimizations offered by FlashNorm. It's particularly useful for applications where maximizing throughput and minimizing latency are critical, provided the chosen inference framework supports the FlashNorm structure.

Overview

Overview

Key Characteristics

Usage Considerations

Full Model Card (README)