RedHatAI/Mistral-Nemo-Instruct-2407-FP8

TEXT GENERATIONConcurrency Cost:1Model Size:12BQuant:FP8Ctx Length:32kPublished:Jul 18, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

RedHatAI/Mistral-Nemo-Instruct-2407-FP8 is a 12 billion parameter instruction-tuned causal language model developed by Neural Magic. This model is an FP8 quantized version of Mistral-Nemo-Instruct-2407, optimized for reduced disk size and GPU memory requirements while maintaining high performance. It achieves an average OpenLLM benchmark score of 71.28, making it suitable for assistant-like chat applications in English.

Loading preview...

Model Overview

RedHatAI/Mistral-Nemo-Instruct-2407-FP8 is a 12 billion parameter instruction-tuned language model developed by Neural Magic. It is a quantized version of the Mistral-Nemo-Instruct-2407 model, specifically optimized using FP8 weight and activation quantization. This optimization significantly reduces the model's disk size and GPU memory footprint by approximately 50% compared to its 16-bit counterpart, making it highly efficient for deployment.

Key Capabilities & Optimizations

  • FP8 Quantization: Weights and activations are quantized to FP8 data types, enabling efficient inference with vLLM (version >= 0.5.0).
  • Performance Retention: Despite quantization, the model largely retains the performance of the unquantized version, achieving an average score of 71.28 on the OpenLLM benchmark (version 1), very close to the unquantized model's 71.61.
  • Architecture: Based on the Mistral-Nemo architecture, designed for text-to-text generation.
  • Intended Use: Primarily designed for commercial and research use in English, particularly for assistant-like chat applications.

Evaluation Highlights

Evaluation on the OpenLLM leaderboard shows strong performance across various tasks, with minimal degradation due to quantization:

  • MMLU (5-shot): 68.50 (100.2% recovery)
  • ARC Challenge (25-shot): 64.68 (98.70% recovery)
  • GSM-8K (5-shot): 73.01 (98.06% recovery)
  • Average Score: 71.28 (99.53% recovery)

Deployment

This model is designed for efficient deployment using the vLLM backend, supporting both direct Python integration and OpenAI-compatible serving.