Name: RedHatAI/Mistral-Nemo-Instruct-2407-FP8 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: RedHatAI

Model Overview

RedHatAI/Mistral-Nemo-Instruct-2407-FP8 is a 12 billion parameter instruction-tuned language model developed by Neural Magic. It is a quantized version of the Mistral-Nemo-Instruct-2407 model, specifically optimized using FP8 weight and activation quantization. This optimization significantly reduces the model's disk size and GPU memory footprint by approximately 50% compared to its 16-bit counterpart, making it highly efficient for deployment.

Key Capabilities & Optimizations

FP8 Quantization: Weights and activations are quantized to FP8 data types, enabling efficient inference with vLLM (version >= 0.5.0).
Performance Retention: Despite quantization, the model largely retains the performance of the unquantized version, achieving an average score of 71.28 on the OpenLLM benchmark (version 1), very close to the unquantized model's 71.61.
Architecture: Based on the Mistral-Nemo architecture, designed for text-to-text generation.
Intended Use: Primarily designed for commercial and research use in English, particularly for assistant-like chat applications.

Evaluation Highlights

Evaluation on the OpenLLM leaderboard shows strong performance across various tasks, with minimal degradation due to quantization:

MMLU (5-shot): 68.50 (100.2% recovery)
ARC Challenge (25-shot): 64.68 (98.70% recovery)
GSM-8K (5-shot): 73.01 (98.06% recovery)
Average Score: 71.28 (99.53% recovery)

Deployment

This model is designed for efficient deployment using the vLLM backend, supporting both direct Python integration and OpenAI-compatible serving.

Overview

Model Overview

Key Capabilities & Optimizations

Evaluation Highlights

Deployment

Full Model Card (README)