arnavgrg/llama-2-13b-chat-nf4-fp16-upscaled
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:13BQuant:FP8Ctx Length:4kLicense:apache-2.0Architecture:Transformer Open Weights Warm

arnavgrg/llama-2-13b-chat-nf4-fp16-upscaled is an upscaled 13 billion parameter Llama-2-chat variant, developed by arnavgrg. This model features linear layers upscaled to FP16 after initial NF4 4-bit quantization, aiming to reduce inference-time quantization/dequantization overhead. It is designed for chat applications, offering a balance between performance and computational efficiency, though with some loss in fidelity compared to the original base model due to the quantization process.

Loading preview...

Model Overview

This model, arnavgrg/llama-2-13b-chat-nf4-fp16-upscaled, is a specialized variant of Meta's Llama-2-13b-chat base model. It has undergone a unique processing pipeline involving NF4 4-bit quantization followed by upscaling its linear layers to FP16.

Key Characteristics

  • Quantization Strategy: Initially quantized using NF4 (NormalFloat 4-bit) via bitsandbytes.
  • Upscaled Precision: The 4-bit linear layers are subsequently upscaled to FP16. This approach aims to mitigate the performance cost associated with on-the-fly quantization/dequantization during inference.
  • Lossy Conversion: It's important to note that the initial NF4 quantization is a lossy operation. Consequently, this model's performance will not be identical to the original, unquantized Llama-2-13b-chat base model.

Usage

This model can be loaded and utilized directly with the transformers library, specifying torch_dtype=torch.float16 for optimal use of its upscaled precision.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p