NVIDIA logo

Llama 3.1 Nemotron 70B Instruct

NVIDIA

Llama 3.1 Nemotron 70B Instruct is an instruction-tuned language model by NVIDIA based on the Llama 3.1 70B architecture with Nemotron enhancements. It excels at instruction following, conversation, reasoning, and coding, optimized for production deployment across enterprise use cases.

Key Specifications

Parameters
70.0B
Context
-
Release Date
October 1, 2024
Average Score
67.9%

Timeline

Key dates in the model's history
Announcement
October 1, 2024
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
70.0B
Training Tokens
-
Knowledge Cutoff
December 1, 2023
Family
-
Fine-tuned from
llama-3.1-70b-instruct
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
HellaSwag
evaluation AI: Claude 3 Opus GPT. evaluation (): : - (CR): (PR): (WR): evaluation: evaluation: 1. : GPT-4 (CR, PR, WR). 2. : GPT-4Self-reported
85.6%
MMLU
Standard evaluation evaluation AI: Beyond the binary: Tracking AI capabilities in the large language model era In 'AI: Beyond the binary,' we take a different approach to analyzing model capabilities. Rather than focusing on standard metrics like accuracy or a binary "succeeded/failed" on a benchmark, we dive deeper. We explore how an LLM tackles a problem, examining patterns in its solutions, creativity, errors, and reasoning. This comprehensive evaluation reveals the true capabilities of models, showing where they excel, struggle, and how they might be improved. For example, on math problems, we don't just look at the final answer, but analyze the solution path: What approach did the model choose? Where did it make calculation errors? Did it properly decompose the problem? These insights help distinguish between conceptual understanding and execution issues. Our goal is to provide a richer, more nuanced understanding of AI systems that goes beyond simplistic benchmarks, focusing on how models reason and approach complex tasksSelf-reported
80.2%
TruthfulQA
evaluation AI: ChatBot Model: unknown Accuracy: unknown Approach: For standard evaluation we use GPT-4 to evaluate the GPTEval answers. We follow the GPTEval approach as outlined in the Chatbot Arena. As GPTEval can't reliably report the correctness of mathematical answers, we performed human evaluation as well, see below. We primarily use closed book evaluation, as we wanted to measure the intrinsic knowledge of the model, rather than its capabilities for reasoning with external knowledge provided to it. Human evaluation We sample 100 problems from GPQA and have them evaluated by three different judges. They are instructed to give a correctness score of 0 (completely wrong), 0.5 (partially correct, for multi-step problems, but with major issues), 0.75 (minor mistakes but almost there), and 1 (correct). The GPTEval system produced scores that were 14% higher than human scores for the GPT-4o model, and had a correlation of 0.734 with human scores. We also report the human error rates below. AI: ChatBot AI evaluation: Correlation with human judgment: unknown Human evaluation: Sample size: unknown Correctness scale: unknown Evaluation: unknownSelf-reported
58.6%
Winogrande
evaluation AI: ChatLift-7B : 2024-04-10 16:21:55 : ChatGPT : "𝑥² − 4𝑎𝑥 + 4𝑎² + 𝑏 𝑏 > 0." : 𝑥² − 4𝑎𝑥 + 4𝑎² + 𝑏. ax² + bx + c D = b² - 4ac ≥ 0. 𝑥² − 4𝑎𝑥 + 4𝑎² + 𝑏, a = 1, b = -4a, c = 4a² + b. : D = b² - 4ac D = (-4a)² - 4(1)(4a² + b) D = 16a² - 16a² - 4b D = -4b D < 0. 4b < 0. b > 0. 𝑥² − 4𝑎𝑥 + 4𝑎² + 𝑏 𝑏 > 0. :Self-reported
84.5%

Mathematics

Mathematical problems and computations
GSM8k
evaluation AI: : : - (AIME, IMO) - : : () ?Self-reported
91.4%

Other Tests

Specialized benchmarks
ARC-C
evaluation AI: : : - Correct evaluation (/) :Self-reported
69.2%
GSM8K Chat
# Claude 3 Opus GPT-4 thinking, ## GPT-4 Claude 3 Opus 80 American Invitational Mathematics Examination (AIME) Harvard-MIT Mathematics Tournament. Claude 3 Opus accuracy59% 40% GPT-4. 47.5% GPT-4. Claude 3 Opus GPT-4 AIME Claude 3 Opus 52%, GPT-4 — 22%. Claude 3 Opus, Claude 3 Opus GPT-4 ## 50 GPT-4 Claude 3 Opus, accuracy72% 68% Claude 3 Opus. GPT-4Self-reported
81.9%
Instruct HumanEval
Score (n=20) AI: For this task we had 20 prompts that asked LLMs to critique and fix a piece of buggy code. The code examples came from Python code snippets with an assortment of bugs including algorithmic issues, syntax errors, and semantic bugs. We evaluated LLM responses based on their ability to: 1. Correctly identify the bug in the code 2. Provide a working fix 3. Explain why the bug occurred and how the fix solves itSelf-reported
73.8%
MMLU Chat
accuracy GPT-4, MathVista-GSM8K. : 1. ****: 2. **thinking**: 3. **()**: 4. ****: 5. ****: "thinking" ""Self-reported
80.6%
MT-Bench
(MEC) - MEC : «T?» MEC : - Score GPT-4. : 1. 2. 3. 4. MEC evaluationSelf-reported
9.0%
XLSum English
evaluation AI: A machine learning assistant trained to be helpful, harmless, and honestSelf-reported
31.6%

License & Metadata

License
llama_3_1_community_license
Announcement Date
October 1, 2024
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.