# Nemotron-Cascade-8B

[![Technical Report](https://img.shields.io/badge/2512.13607-Technical_Report-blue)](https://arxiv.org/abs/2512.13607) [![SFT Dataset](https://img.shields.io/badge/🤗-SFT_Datset-blue)](https://huggingface.co/collections/nvidia/nemotron-cascade) [![RL Dataset](https://img.shields.io/badge/🤗-RL_Datset-blue)](https://huggingface.co/collections/nvidia/nemotron-cascade) [![Models](https://img.shields.io/badge/🤗-Models-blue)](https://huggingface.co/collections/nvidia/nemotron-cascade)

## Introduction We're excited to introduce [Nemotron-Cascade-8B](https://huggingface.co/nvidia/Nemotron-Cascade-8B), a powerful general-purpose model trained through sequential and domain-wise reinforcement learning. Nemotron-Cascade-8B is post-trained from the [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base) model. It operates in both ***thinking*** and ***instruct*** (non-reasoning) modes and delivers best-in-class performance across a wide range of benchmarks. ## Training Pipeline train_pipeline_fig

The training pipeline for Nemotron-Cascade begins with a multi-stage SFT phase to equip the model with foundational skills. Subsequently, Cascade RL is applied across multiple domains to further enhance the model’s performance in these areas. Notably, RLHF for alignment, when used as a pre-step, boosts the model’s complex reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in the following Figure).

lcb_through_cascade_rl_fig — The LiveCodeBench v6 (08/24–05/25) performance of the Nemotron-Cascade-14B-Thinking model throughout the Cascade RL process.

## Results - We evaluate our model against competitive reasoning models on a diverse set of benchmarks, covering general-knowledge reasoning, alignment and instruction following, mathematical reasoning, competitive programming, software engineering, and tool-use proficiency. - For Nemotron-Cascade models, we use a maximum generation length of 64K tokens and set the temperature to 0.6 and top-p to 0.95 for reasoning tasks. - Our Nemotron-Cascade models achieve best-in-class performance across almost all benchmarks. Remarkably, Nemotron-Cascade-8B and Nemotron-Cascade-8B-Thinking achieve comparable LiveCodeBench (LCB) and LCB Pro scores to DeepSeek-R1-0528 (671B). | **Benchmark
Metric: Pass@1** | **Qwen3-8B** | **Nemotron-Nano-9B-v2** | **DeepSeek-R1-0528 671B** | **Gemini-2.5-Flash-Thinking** | **Nemotron-
Cascade-8B-
Thinking** | **Nemotron-
Cascade-8B** | | :---- | :---: | :---: | :---: | :---: | :---: | :---: | | ***Knowledge Reasoning*** | | MMLU | 83.0 | 82.6 | 89.9 | - | 84.0 | 83.7 | | MMLU Pro | 75.1 | 73.3 | 85.0 | 81.9 | 75.5 | 75.7 | | GPQA-Diamond | 62.0 | 64.0 | 81.0 | 82.8 | 66.7 | 66.5 | | ***Alignment*** | | ArenaHard | 85.8 | 74.6 | 95.1 | 95.7 | 85.8 | 87.9 | | IFEval (Strict Prompt) | 85.0 | 86.1 | 84.1 | 89.8 | 83.7 | 90.2 | | IFBench | 34.4 | 37.4 | 38.0 | 36.1 | 41.4 | 40.8 | | ***Math*** | | AIME 2024 | 76.0 | 81.9 | 91.4 | 82.3 | 88.8 | 89.5 | | AIME 2025 | 67.3 | 72.0 | 87.5 | 72.0 | 81.4 | 80.1 | | ***Code*** | | LCB v5 (08/24-02/25) | 61.2 | 68.2 | 74.8 | 63.4 | 74.5 | 74.3 | | LCB v6 (08/24-05/25) | 58.3 | 65.3 | 73.3 | 61.9 | 71.4 | 71.1 | | LCB Pro 25Q2 (Easy) | 46.1 | 59.3 | 63.9 | 47.4 | 64.8 | 65.7 | | LCB Pro 25Q2 (Med) | 2.2 | 4.8 | 7.0 | 1.8 | 6.1 | 6.4 | | SWE Verified (Agentless) | 20.5 | - | 57.6 | 48.9 | 38.5 | 37.2 | | ***Tool Calling*** | | BFCL V3 | 68.1 | 66.9 | 67.9 | 68.6 | 67.0 | 64.4 | ## Usage Recommendations For local deployment, we recommend setting the sampling parameters to temperature = 0.6, top_p = 0.95. We recommend using RoPE scaling with the [YaRN](https://arxiv.org/abs/2309.00071) method for better long-context support. This can be enabled by updating the model’s `config.json` as shown below: ```json { ..., "rope_scaling": { "rope_type": "yarn", "factor": 2.0, "original_max_position_embeddings": 32768 } } ``` - **Nemotron-Cascade-14B-Thinking**: use `factor: 3.0` to extend the context length to 90K tokens for SWE Verified (Agentless), and `factor: 2.0` to extend the context length to 64K tokens for other benchmarks. - **Nemotron-Cascade-8B** and **Nemotron-Cascade-8B-Thinking**: use `factor: 2.0` across all benchmarks. ## Evaluation Tookit To reproduce our results, please check evaluation code, scripts, cached prediction files in https://huggingface.co/nvidia/Nemotron-Cascade-8B/blob/main/evaluation/README.md ## Chat Template Nemotron-Cascade-8B follows the Qwen3-style ChatML template and offers both ***thinking*** and ***instruct*** (non-reasoning) modes. To switch between these two modes, simply append the `" /think"` (for ***thinking***) or the `" /no_think"` (for ***instruct***) tag to the end of the user input. Note that a leading space is included in these tags to ensure correct tokenization. To reduce the context length in a multi-turn conversation, when the previous user turn involves thinking mode, only the final summary of the model's output will be added to the conversation history, and the user turn tag is switched from `" /think"` to `" /no_think"`. A brief example is shown below: ```python from transformers import AutoTokenizer model_name = 'nvidia/Nemotron-Cascade-8B' tokenizer = AutoTokenizer.from_pretrained(model_name) ''' single-turn example ''' messages = [ {"role": "user", "content": "calculate 1+1?"} ] # thinking mode prompt_thinking = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True) # prompt_thinking = '<|im_start|>system\nYou are a helpful and harmless assistant.<|im_end|>\n<|im_start|>user\ncalculate 1+1? /think<|im_end|>\n<|im_start|>assistant\n' # instruct mode prompt_instruct = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False) # prompt_instruct = '<|im_start|>system\nYou are a helpful and harmless assistant.<|im_end|>\n<|im_start|>user\ncalculate 1+1? /no_think<|im_end|>\n<|im_start|>assistant\n' ''' multi-turn example ''' messages = [ {"role": "user", "content": "calculate 1+1?"}, {"role": "assistant", "content": "THINKING_CONTENT\nTo calculate \\(1 + 1\\):\n\n1. **Identify the operation**: This is a basic addition problem involving two integers.\n2. **Perform the addition**: \n \\(1 + 1 = 2\\).\n\n**Result**: \\(\\boxed{2}\\)",}, {"role": "user", "content": "what about 2+2"} ] # thinking mode prompt_thinking = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True) # prompt_thinking = '<|im_start|>system\nYou are a helpful and harmless assistant.<|im_end|>\n<|im_start|>user\ncalculate 1+1? /no_think<|im_end|>\n<|im_start|>assistant\nTo calculate \\(1 + 1\\):\n\n1. **Identify the operation**: This is a basic addition problem involving two integers.\n2. **Perform the addition**: \n \\(1 + 1 = 2\\).\n\n**Result**: \\(\\boxed\{2\}\\)<|im_end|>\n<|im_start|>user\nwhat about 2+2 /think<|im_end|>\n<|im_start|>assistant\n' # instruct mode prompt_instruct = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False) # prompt_instruct = '<|im_start|>system\nYou are a helpful and harmless assistant.<|im_end|>\n<|im_start|>user\ncalculate 1+1? /no_think<|im_end|>\n<|im_start|>assistant\nTo calculate \\(1 + 1\\):\n\n1. **Identify the operation**: This is a basic addition problem involving two integers.\n2. **Perform the addition**: \n \\(1 + 1 = 2\\).\n\n**Result**: \\(\\boxed\{2\}\\)<|im_end|>\n<|im_start|>user\nwhat about 2+2 /no_think<|im_end|>\n<|im_start|>assistant\n' ``` ## Release Date Dec 08, 2025 ## License Your use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). ## Citation ``` @article{Nemotron_Cascade_Scaling_Cascaded_Reinforcement_Learning, title={Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models}, author={Wang, Boxin and Lee, Chankyu and Lee, Nayeon and Lin, Sheng-Chieh and Dai, Wenliang and Chen, Yang and Chen, Yangyi and Yang, Zhuolin and Liu, Zihan and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei}, year={2025} } ```