Name: RthItalia/NanoLLM-Qwen2.5-3B-v3.1 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: RthItalia

NanoLLM Qwen v3.1 Overview

RthItalia/NanoLLM-Qwen2.5-3B-v3.1 is a 3.1 billion parameter model that leverages the NanoLLM quantization pipeline to create compact overlay artifacts for Qwen2.5 models. This approach allows the model to start from a base Qwen2.5 model loaded in bitsandbytes 8-bit mode, then selectively replaces modules with TrueQuantLinear modules. This process aims to optimize the model's footprint and runtime efficiency while preserving performance.

Key Characteristics

Quantization Method: Employs a proprietary NanoLLM cascade to replace specific modules with TrueQuantLinear for efficient inference.
Base Model Loading: The base Qwen2.5 model is loaded in 8-bit mode, with experimental support for 4-bit loading via NANO_LOAD_4BIT=1.
Performance Validation: Artifacts are validated against an 8-bit reference, ensuring an average next-token-logit cosine similarity of 0.99 or higher, indicating high fidelity to the original model's output.
Compact Size: The artifacts are designed to be compact, facilitating easier deployment and reduced memory footprint.

Use Cases

Efficient Deployment: Ideal for environments where memory and computational resources are constrained, allowing for the use of Qwen2.5 models with reduced overhead.
Research and Evaluation: Published artifacts are suitable for research and evaluation of the NanoLLM quantization methodology.
Language Generation: Capable of general language generation tasks, as demonstrated by the quick start example for Python function generation.