nvidia/Nemotron-Cascade-8B-Thinking

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Dec 8, 2025License:nvidia-open-model-licenseArchitecture:Transformer0.0K Open Weights Warm

Nemotron-Cascade-8B-Thinking is an 8 billion parameter general-purpose language model developed by NVIDIA, post-trained from Qwen3-8B-Base. It is specifically designed for "thinking" mode tasks, leveraging sequential and domain-wise reinforcement learning to achieve best-in-class performance across various reasoning, alignment, mathematical, and coding benchmarks. This model excels in complex reasoning abilities, making it suitable for applications requiring advanced problem-solving and analytical thought.

Loading preview...

Overview

NVIDIA's Nemotron-Cascade-8B-Thinking is an 8 billion parameter general-purpose language model, built upon the Qwen3-8B-Base architecture. It distinguishes itself through a unique training pipeline involving multi-stage Supervised Fine-Tuning (SFT) followed by Cascade Reinforcement Learning (RL) across multiple domains. This model is exclusively optimized for a "thinking" mode, enhancing its ability to perform complex reasoning tasks.

Key Capabilities

  • Advanced Reasoning: Achieves best-in-class performance across a diverse set of benchmarks including general-knowledge reasoning, mathematical reasoning (e.g., AIME 2024/2025), and competitive programming (LiveCodeBench).
  • Reinforcement Learning Enhancement: Utilizes RLHF as a pre-step to significantly boost complex reasoning, with subsequent domain-wise RLVR stages further refining performance without degradation.
  • Code Performance: Demonstrates strong capabilities in coding benchmarks like LiveCodeBench (LCB v5, v6) and SWE Verified, with scores comparable to much larger models like DeepSeek-R1-0528 (671B).
  • Alignment and Instruction Following: Shows robust performance in alignment benchmarks such as ArenaHard and IFBench.
  • Optimized for "Thinking" Mode: Designed specifically for tasks requiring deep analytical thought, indicated by its unique chat template requiring a " /think" tag for user input.

Usage Recommendations

  • Sampling Parameters: Recommended settings are temperature = 0.6 and top_p = 0.95 for local deployment.
  • Long Context Support: Supports extended context lengths using RoPE scaling with the YaRN method; specifically, factor: 2.0 is recommended for this model across all benchmarks.

Good For

  • Applications requiring strong general-purpose reasoning and problem-solving.
  • Tasks involving mathematical and logical deduction.
  • Code generation and software engineering challenges.
  • Scenarios where a model's "thought process" or intermediate reasoning steps are beneficial.