nm-testing/convert_ct_dequant-e2e

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.1BQuant:BF16Ctx Length:2kPublished:May 27, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The nm-testing/convert_ct_dequant-e2e model is a 1.1 billion parameter auto-regressive language model, a quantized version of Alibaba's Qwen3-8B. Developed by NVIDIA using TensorRT Model Optimizer, it is optimized for efficient deployment in AI agent systems, chatbots, and RAG systems. This model is specifically designed for inference on NVIDIA GPU-accelerated systems, leveraging FP4 quantization for faster performance.

Loading preview...

Model Overview

This model, nm-testing/convert_ct_dequant-e2e, is a 1.1 billion parameter auto-regressive language model. It is a quantized version of Alibaba's Qwen3-8B, developed by NVIDIA using the TensorRT Model Optimizer. The model's weights and activations are quantized to FP4 data type, making it ready for efficient inference with TensorRT-LLM.

Key Capabilities

  • Quantized Performance: Utilizes FP4 quantization for optimized inference on NVIDIA GPU-accelerated systems, including Blackwell microarchitecture.
  • Transformer Architecture: Based on the Qwen3-8B's optimized transformer architecture.
  • Text-to-Text Generation: Processes text input (strings) to generate text output (strings).
  • Commercial Use: Licensed under Apache 2.0, suitable for both commercial and non-commercial applications.

Use Cases

This model is particularly well-suited for developers seeking pre-quantized models for deployment in:

  • AI Agent systems
  • Chatbots
  • RAG (Retrieval Augmented Generation) systems
  • Other AI-powered applications requiring efficient, GPU-accelerated inference.