arnavgrg/llama-2-13b-chat-nf4-fp16-upscaled is an upscaled 13 billion parameter Llama-2-chat variant, developed by arnavgrg. This model features linear layers upscaled to FP16 after initial NF4 4-bit quantization, aiming to reduce inference-time quantization/dequantization overhead. It is designed for chat applications, offering a balance between performance and computational efficiency, though with some loss in fidelity compared to the original base model due to the quantization process.
Loading preview...
Model Overview
This model, arnavgrg/llama-2-13b-chat-nf4-fp16-upscaled, is a specialized variant of Meta's Llama-2-13b-chat base model. It has undergone a unique processing pipeline involving NF4 4-bit quantization followed by upscaling its linear layers to FP16.
Key Characteristics
- Quantization Strategy: Initially quantized using NF4 (NormalFloat 4-bit) via
bitsandbytes. - Upscaled Precision: The 4-bit linear layers are subsequently upscaled to FP16. This approach aims to mitigate the performance cost associated with on-the-fly quantization/dequantization during inference.
- Lossy Conversion: It's important to note that the initial NF4 quantization is a lossy operation. Consequently, this model's performance will not be identical to the original, unquantized Llama-2-13b-chat base model.
Usage
This model can be loaded and utilized directly with the transformers library, specifying torch_dtype=torch.float16 for optimal use of its upscaled precision.
Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.