arnavgrg/mistral-7b-nf4-fp16-upscaled

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:8kLicense:apache-2.0Architecture:Transformer Open Weights Warm

The arnavgrg/mistral-7b-nf4-fp16-upscaled model is an FP16 variant of the Mistral-7B base model, upscaled after initial NF4 4-bit quantization. This approach aims to reduce inference-time quantization/dequantization costs by converting the linear4bit layers to FP16. While this process involves lossy quantization, it is designed for efficient deployment where FP16 precision is preferred for performance. It is suitable for users seeking a Mistral-7B derivative optimized for faster inference with a slight trade-off in fidelity due to the initial quantization.

Loading preview...

arnavgrg/mistral-7b-nf4-fp16-upscaled: Optimized Mistral-7B Variant

This model is an FP16 (floating-point 16-bit) version of the original Mistral-7B base model, developed by arnavgrg. It has undergone a specific optimization process where the model was initially loaded with NF4 4-bit quantization via bitsandbytes, and subsequently, its linear4bit layers were upscaled to FP16.

Key Characteristics

  • Upscaled FP16 Variant: The primary feature is the upscaling of linear4bit layers to FP16 after initial NF4 quantization.
  • Inference Cost Reduction: This upscaling aims to minimize the computational overhead associated with quantization and dequantization during each forward pass at inference time.
  • Lossy Quantization: It's important to note that the initial NF4 quantization is a lossy operation, meaning the model's weights for the linear layers are not perfectly preserved. Consequently, this variant may not perform identically to the official, unquantized Mistral-7B base model.

Usage

This model can be loaded directly using the transformers library in FP16, facilitating straightforward integration into existing workflows:

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
  "arnavgrg/mistral-7b-nf4-fp16-upscaled",
  device_map="auto",
  torch_dtype=torch.float16,
)

When to Use This Model

This model is particularly suited for scenarios where:

  • Faster Inference is Critical: The FP16 upscaling helps reduce the computational cost during inference.
  • Resource-Constrained Environments: While not explicitly stated, FP16 models generally offer better performance on hardware optimized for half-precision arithmetic.
  • Acceptable Fidelity Trade-off: Users are willing to accept a minor reduction in model performance compared to the full-precision base model in exchange for speed benefits.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p