nvidia/Llama-3.1-Nemotron-Nano-8B-v1

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Mar 16, 2025License:nvidia-open-model-licenseArchitecture:Transformer0.2K Open Weights Warm

The nvidia/Llama-3.1-Nemotron-Nano-8B-v1 is an 8 billion parameter large language model developed by NVIDIA, derived from Meta Llama-3.1-8B-Instruct. It is specifically post-trained for enhanced reasoning, human chat preferences, RAG, and tool calling, offering a balance of accuracy and efficiency. This model supports a 32,768 token context length and is optimized for deployment on a single RTX GPU, making it suitable for local use in AI agent systems and chatbots.

Loading preview...

Model Overview

NVIDIA's Llama-3.1-Nemotron-Nano-8B-v1 is an 8 billion parameter large language model, a derivative of Meta Llama-3.1-8B-Instruct. It has undergone a multi-phase post-training process, including supervised fine-tuning for Math, Code, Reasoning, and Tool Calling, as well as reinforcement learning stages (REINFORCE and Online Reward-aware Preference Optimization) for chat and instruction-following. This model aims to provide a strong balance between accuracy and computational efficiency, capable of running on a single RTX GPU locally.

Key Capabilities

  • Enhanced Reasoning: Significantly improved performance in reasoning tasks, as demonstrated by benchmarks like MATH500 (95.4% pass@1 with reasoning on) and AIME25 (47.1% pass@1 with reasoning on).
  • Instruction Following & Chat: Optimized for human chat preferences and general instruction following, with specific modes for "Reasoning On" and "Reasoning Off" controlled via system prompts.
  • Tool Calling: Features improved capabilities for tool calling, as indicated by BFCL v2 Live scores.
  • Code Generation: Strong performance in code generation, achieving 84.6% pass@1 on MBPP 0-shot with reasoning on.
  • Multilingual Support: Primarily intended for English and coding languages, with support for other non-English languages including German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
  • Extended Context: Supports a context length of up to 131,072 tokens.

Good For

  • Developers building AI Agent systems, chatbots, and RAG systems.
  • Applications requiring a strong balance of model accuracy and compute efficiency.
  • Local deployment on single RTX GPUs.
  • Tasks involving complex reasoning, mathematical problem-solving, and code generation.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p