mgoin/Mistral-Nemo-Instruct-2407-FP8-Dynamic

TEXT GENERATIONConcurrency Cost:1Model Size:12BQuant:FP8Ctx Length:32kPublished:Jul 18, 2024License:apache-2.0Architecture:Transformer Open Weights Cold

The mgoin/Mistral-Nemo-Instruct-2407-FP8-Dynamic is a 12 billion parameter instruction-tuned causal language model, developed jointly by Mistral AI and NVIDIA, and quantized to FP8 weights with dynamic activations for efficient deployment in vLLM. It is a fine-tuned version of Mistral-Nemo-Base-2407, featuring a 32768-token context window and strong performance across general benchmarks, multilingual tasks, and code data. This model is optimized for high-performance inference while maintaining competitive accuracy, making it suitable for a wide range of generative AI applications.

Loading preview...

Overview

This model, mgoin/Mistral-Nemo-Instruct-2407-FP8-Dynamic, is an FP8 quantized version of the Mistral-Nemo-Instruct-2407 LLM, developed by Mistral AI and NVIDIA. It is specifically compressed with dynamic activations for efficient use within vLLM, offering a balance of performance and resource utilization. The base model is an instruction-tuned variant of Mistral-Nemo-Base-2407, designed to significantly outperform other models of similar or smaller scale.

Key Capabilities & Features

  • Quantized for Efficiency: Compressed to FP8 weights with dynamic activations, ideal for high-throughput inference in vLLM.
  • Robust Architecture: Features 40 layers, 5,120 dimensions, and a 128k vocabulary size, utilizing a SwiGLU activation function and Grouped-Query Attention (GQA) with 8 KV-heads.
  • Extensive Context Window: Supports a 128k context window, enabling processing of long inputs and complex tasks.
  • Multilingual & Code Proficiency: Trained on a substantial amount of multilingual and code data, enhancing its versatility.
  • Strong Benchmark Performance: Achieves notable scores on benchmarks such as MMLU (68.0%), HellaSwag (83.5%), and Winogrande (76.8%), alongside competitive multilingual MMLU scores (e.g., French 62.3%, German 62.7%).
  • Apache 2.0 License: Released under a permissive license, allowing broad usage and deployment.

Usage & Deployment

This model is designed for easy integration with vLLM, and the base model can also be used with mistral_inference and Hugging Face transformers. It supports chat and function calling capabilities, making it suitable for interactive AI applications. The developers recommend using a temperature of 0.3 for optimal generation quality.

Limitations

The Mistral Nemo Instruct model is presented as a demonstration of the base model's fine-tuning potential. It currently lacks built-in moderation mechanisms, and the developers are actively seeking community engagement to implement guardrails for safe deployment in environments requiring moderated outputs.