nvidia/Qwen3-Nemotron-32B-RLBFF

Hugging Face
TEXT GENERATIONConcurrency Cost:2Model Size:32BQuant:FP8Ctx Length:32kPublished:Oct 12, 2025License:otherArchitecture:Transformer0.0K Warm

The nvidia/Qwen3-Nemotron-32B-RLBFF is a 32 billion parameter large language model developed by NVIDIA, built upon the Qwen/Qwen3-32B foundation. It is fine-tuned using Reinforcement Learning from Binary Flexible Feedback (RLBFF) to enhance the quality of LLM-generated responses in a default thinking mode. This research model excels at generating responses to multi-turn user queries, demonstrating improved performance on benchmarks like Arena Hard V2, WildBench, and MT Bench compared to its base model.

Loading preview...

Model Overview

The nvidia/Qwen3-Nemotron-32B-RLBFF is a 32 billion parameter large language model developed by NVIDIA, based on the Qwen/Qwen3-32B architecture. This research model is specifically fine-tuned using Reinforcement Learning from Binary Flexible Feedback (RLBFF) to significantly improve the quality of its responses, particularly in conversational contexts. It is designed to generate coherent and high-quality replies to the final user turn in a multi-turn conversation.

Key Capabilities & Performance

  • Enhanced Response Quality: Fine-tuned with RLBFF to produce superior LLM-generated responses.
  • Strong Benchmark Performance: Achieves 55.6% on Arena Hard V2, 70.33% on WildBench, and 9.50 on MT Bench, outperforming the base Qwen3-32B model and showing comparable performance to models like DeepSeek R1 and O3-mini at a fraction of the inference cost.
  • Context Length: Supports a maximum input of 128k tokens, though it was trained on conversations up to 4K tokens.
  • Research Focus: Released to support the research paper on RLBFF (arXiv:2509.21319).

Use Cases

  • Conversational AI: Ideal for generating responses in multi-turn dialogues.
  • Research & Development: Suitable for researchers exploring advanced fine-tuning techniques and model performance improvements.

This model is optimized for NVIDIA GPU-accelerated systems, leveraging hardware and software frameworks like CUDA for faster inference.