z-lab/Llama-2-7b-hf-PARO
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Oct 29, 2025License:llama2Architecture:Transformer0.0K Open Weights Cold

z-lab/Llama-2-7b-hf-PARO is a 4-bit quantized version of the Llama-2-7b-hf model, developed by z-lab using their ParoQuant method. ParoQuant is an INT4 quantization technique designed to minimize the accuracy gap with FP16 models while achieving near-AWQ inference speeds. This model is optimized for efficient reasoning in large language models, supporting deployment on NVIDIA GPUs (via vLLM, Transformers) and Apple Silicon (via MLX). It is particularly suited for applications requiring high-performance, resource-efficient LLM inference.

Loading preview...