Llama 3.1 Nemotron 70B Instruct

Name: Llama 3.1 Nemotron 70B Instruct
Author: NVIDIA

NVIDIA

Llama 3.1 Nemotron 70B Instruct is an instruction-tuned language model by NVIDIA based on the Llama 3.1 70B architecture with Nemotron enhancements. It excels at instruction following, conversation, reasoning, and coding, optimized for production deployment across enterprise use cases.

Key Specifications

Parameters

70.0B

Context

Release Date

October 1, 2024

Average Score

67.9%

Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

October 1, 2024

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

70.0B

Training Tokens

Knowledge Cutoff

December 1, 2023

Family

Fine-tuned from

llama-3.1-70b-instruct

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

HellaSwag

evaluation AI: Claude 3 Opus GPT. evaluation (): : - (CR): (PR): (WR): evaluation: evaluation: 1. : GPT-4 (CR, PR, WR). 2. : GPT-4 • Self-reported

85.6%

MMLU

Standard evaluation evaluation AI: Beyond the binary: Tracking AI capabilities in the large language model era In 'AI: Beyond the binary,' we take a different approach to analyzing model capabilities. Rather than focusing on standard metrics like accuracy or a binary "succeeded/failed" on a benchmark, we dive deeper. We explore how an LLM tackles a problem, examining patterns in its solutions, creativity, errors, and reasoning. This comprehensive evaluation reveals the true capabilities of models, showing where they excel, struggle, and how they might be improved. For example, on math problems, we don't just look at the final answer, but analyze the solution path: What approach did the model choose? Where did it make calculation errors? Did it properly decompose the problem? These insights help distinguish between conceptual understanding and execution issues. Our goal is to provide a richer, more nuanced understanding of AI systems that goes beyond simplistic benchmarks, focusing on how models reason and approach complex tasks • Self-reported

80.2%

TruthfulQA

evaluation AI: ChatBot Model: unknown Accuracy: unknown Approach: For standard evaluation we use GPT-4 to evaluate the GPTEval answers. We follow the GPTEval approach as outlined in the Chatbot Arena. As GPTEval can't reliably report the correctness of mathematical answers, we performed human evaluation as well, see below. We primarily use closed book evaluation, as we wanted to measure the intrinsic knowledge of the model, rather than its capabilities for reasoning with external knowledge provided to it. Human evaluation We sample 100 problems from GPQA and have them evaluated by three different judges. They are instructed to give a correctness score of 0 (completely wrong), 0.5 (partially correct, for multi-step problems, but with major issues), 0.75 (minor mistakes but almost there), and 1 (correct). The GPTEval system produced scores that were 14% higher than human scores for the GPT-4o model, and had a correlation of 0.734 with human scores. We also report the human error rates below. AI: ChatBot AI evaluation: Correlation with human judgment: unknown Human evaluation: Sample size: unknown Correctness scale: unknown Evaluation: unknown • Self-reported

58.6%

Winogrande

evaluation AI: ChatLift-7B : 2024-04-10 16:21:55 : ChatGPT : "𝑥² − 4𝑎𝑥 + 4𝑎² + 𝑏 𝑏 > 0." : 𝑥² − 4𝑎𝑥 + 4𝑎² + 𝑏. ax² + bx + c D = b² - 4ac ≥ 0. 𝑥² − 4𝑎𝑥 + 4𝑎² + 𝑏, a = 1, b = -4a, c = 4a² + b. : D = b² - 4ac D = (-4a)² - 4(1)(4a² + b) D = 16a² - 16a² - 4b D = -4b D < 0. 4b < 0. b > 0. 𝑥² − 4𝑎𝑥 + 4𝑎² + 𝑏 𝑏 > 0. : • Self-reported

84.5%

Mathematics

Mathematical problems and computations

GSM8k

evaluation AI: : : - (AIME, IMO) - : : () ? • Self-reported

91.4%

Other Tests

Specialized benchmarks

ARC-C

evaluation AI: : : - Correct evaluation (/) : • Self-reported

69.2%

GSM8K Chat

# Claude 3 Opus GPT-4 thinking, ## GPT-4 Claude 3 Opus 80 American Invitational Mathematics Examination (AIME) Harvard-MIT Mathematics Tournament. Claude 3 Opus accuracy59% 40% GPT-4. 47.5% GPT-4. Claude 3 Opus GPT-4 AIME Claude 3 Opus 52%, GPT-4 — 22%. Claude 3 Opus, Claude 3 Opus GPT-4 ## 50 GPT-4 Claude 3 Opus, accuracy72% 68% Claude 3 Opus. GPT-4 • Self-reported

81.9%

Instruct HumanEval

Score (n=20) AI: For this task we had 20 prompts that asked LLMs to critique and fix a piece of buggy code. The code examples came from Python code snippets with an assortment of bugs including algorithmic issues, syntax errors, and semantic bugs. We evaluated LLM responses based on their ability to: 1. Correctly identify the bug in the code 2. Provide a working fix 3. Explain why the bug occurred and how the fix solves it • Self-reported

73.8%

MMLU Chat

accuracy GPT-4, MathVista-GSM8K. : 1. ****: 2. **thinking**: 3. **()**: 4. ****: 5. ****: "thinking" "" • Self-reported

80.6%

MT-Bench

(MEC) - MEC : «T?» MEC : - Score GPT-4. : 1. 2. 3. 4. MEC evaluation • Self-reported

9.0%

XLSum English

evaluation AI: A machine learning assistant trained to be helpful, harmless, and honest • Self-reported

31.6%

License & Metadata

License

llama_3_1_community_license

Announcement Date

October 1, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Nemotron 3 Nano (30B A3B)

NVIDIA

32.0B

Best score:0.8 (GPQA)

Released:Dec 2025

Price:$0.06/1M tokens

Llama-3.3 Nemotron Super 49B v1

NVIDIA

49.9B

Best score:0.7 (GPQA)

Released:Mar 2025

Nemotron 3 Super (120B A12B)

NVIDIA

120.0B

Best score:0.8 (GPQA)

Released:Mar 2026

Llama 3.1 Nemotron Ultra 253B v1

NVIDIA

253.0B

Best score:0.8 (GPQA)

Released:Apr 2025

GLM-4.7-Flash

Zhipu AI

30.0B

Best score:0.8 (TAU)

Released:Jan 2026

Price:$0.07/1M tokens

ERNIE 4.5

Baidu

21.0B

Best score:0.7 (GPQA)

Released:Jun 2025

Llama 3.3 70B Instruct

LongCat-Flash-Lite

Meituan

68.5B

Best score:0.9 (MMLU)

Released:Feb 2026

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.