arnavgrg/mistral-7b-instruct-nf4-fp16-upscaled
arnavgrg/mistral-7b-instruct-nf4-fp16-upscaled is an upscaled fp16 variant of the Mistral-7B-Instruct-v0.1 base model, originally loaded with nf4 4-bit quantization. This model aims to reduce inference-time quantization/dequantization costs by upscaling linear4bit layers to fp16. It is designed for users seeking a Mistral-7B-Instruct variant with potentially faster inference due to reduced quantization overhead, though it incurs some loss in weight fidelity.
Loading preview...
Overview
This model, arnavgrg/mistral-7b-instruct-nf4-fp16-upscaled, is a specialized variant of the Mistral-7B-Instruct-v0.1 base model. It originates from a version of Mistral-7B-Instruct-v0.1 that was initially quantized using nf4 4-bit quantization via bitsandbytes.
Key Characteristics
The primary modification in this model involves upscaling the linear4bit layers to fp16. This process is intended to mitigate the computational overhead associated with quantization and dequantization during each forward pass at inference time. By converting these layers to fp16, the model aims to offer potentially faster inference speeds compared to a dynamically quantized nf4 model.
Important Considerations
It is crucial to note that the nf4 quantization operation is inherently lossy. Consequently, the model weights for the linear layers in this fp16 upscaled variant retain this lossy characteristic. This means that while it may offer speed advantages, its performance will not be equivalent to the official, unquantized base model. Users should be aware of this trade-off between inference speed and potential accuracy degradation.
Usage
This model can be loaded and utilized directly with the transformers library, specifying torch.float16 as the torch_dtype for efficient loading and inference.
Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.