arnavgrg/mistral-7b-instruct-nf4-fp16-upscaled

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:8kLicense:apache-2.0Architecture:Transformer Open Weights Cold

arnavgrg/mistral-7b-instruct-nf4-fp16-upscaled is an upscaled fp16 variant of the Mistral-7B-Instruct-v0.1 base model, originally loaded with nf4 4-bit quantization. This model aims to reduce inference-time quantization/dequantization costs by upscaling linear4bit layers to fp16. It is designed for users seeking a Mistral-7B-Instruct variant with potentially faster inference due to reduced quantization overhead, though it incurs some loss in weight fidelity.

Loading preview...

Overview

This model, arnavgrg/mistral-7b-instruct-nf4-fp16-upscaled, is a specialized variant of the Mistral-7B-Instruct-v0.1 base model. It originates from a version of Mistral-7B-Instruct-v0.1 that was initially quantized using nf4 4-bit quantization via bitsandbytes.

Key Characteristics

The primary modification in this model involves upscaling the linear4bit layers to fp16. This process is intended to mitigate the computational overhead associated with quantization and dequantization during each forward pass at inference time. By converting these layers to fp16, the model aims to offer potentially faster inference speeds compared to a dynamically quantized nf4 model.

Important Considerations

It is crucial to note that the nf4 quantization operation is inherently lossy. Consequently, the model weights for the linear layers in this fp16 upscaled variant retain this lossy characteristic. This means that while it may offer speed advantages, its performance will not be equivalent to the official, unquantized base model. Users should be aware of this trade-off between inference speed and potential accuracy degradation.

Usage

This model can be loaded and utilized directly with the transformers library, specifying torch.float16 as the torch_dtype for efficient loading and inference.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p