Llama 3.1 Nemotron 70B Instruct
Llama 3.1 Nemotron 70B Instruct is an instruction-tuned language model by NVIDIA based on the Llama 3.1 70B architecture with Nemotron enhancements. It excels at instruction following, conversation, reasoning, and coding, optimized for production deployment across enterprise use cases.
Key Specifications
Parameters
70.0B
Context
-
Release Date
October 1, 2024
Average Score
67.9%
Timeline
Key dates in the model's history
Announcement
October 1, 2024
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
70.0B
Training Tokens
-
Knowledge Cutoff
December 1, 2023
Family
-
Fine-tuned from
llama-3.1-70b-instruct
Capabilities
MultimodalZeroEval
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
HellaSwag
evaluation AI: Claude 3 Opus GPT. evaluation (): : - (CR): (PR): (WR): evaluation: evaluation: 1. : GPT-4 (CR, PR, WR). 2. : GPT-4 • Self-reported
MMLU
Standard evaluation evaluation AI: Beyond the binary: Tracking AI capabilities in the large language model era In 'AI: Beyond the binary,' we take a different approach to analyzing model capabilities. Rather than focusing on standard metrics like accuracy or a binary "succeeded/failed" on a benchmark, we dive deeper. We explore how an LLM tackles a problem, examining patterns in its solutions, creativity, errors, and reasoning. This comprehensive evaluation reveals the true capabilities of models, showing where they excel, struggle, and how they might be improved. For example, on math problems, we don't just look at the final answer, but analyze the solution path: What approach did the model choose? Where did it make calculation errors? Did it properly decompose the problem? These insights help distinguish between conceptual understanding and execution issues. Our goal is to provide a richer, more nuanced understanding of AI systems that goes beyond simplistic benchmarks, focusing on how models reason and approach complex tasks • Self-reported
TruthfulQA
evaluation AI: ChatBot Model: unknown Accuracy: unknown Approach: For standard evaluation we use GPT-4 to evaluate the GPTEval answers. We follow the GPTEval approach as outlined in the Chatbot Arena. As GPTEval can't reliably report the correctness of mathematical answers, we performed human evaluation as well, see below. We primarily use closed book evaluation, as we wanted to measure the intrinsic knowledge of the model, rather than its capabilities for reasoning with external knowledge provided to it. Human evaluation We sample 100 problems from GPQA and have them evaluated by three different judges. They are instructed to give a correctness score of 0 (completely wrong), 0.5 (partially correct, for multi-step problems, but with major issues), 0.75 (minor mistakes but almost there), and 1 (correct). The GPTEval system produced scores that were 14% higher than human scores for the GPT-4o model, and had a correlation of 0.734 with human scores. We also report the human error rates below. AI: ChatBot AI evaluation: Correlation with human judgment: unknown Human evaluation: Sample size: unknown Correctness scale: unknown Evaluation: unknown • Self-reported
Winogrande
evaluation AI: ChatLift-7B : 2024-04-10 16:21:55 : ChatGPT : "𝑥² − 4𝑎𝑥 + 4𝑎² + 𝑏 𝑏 > 0." : 𝑥² − 4𝑎𝑥 + 4𝑎² + 𝑏. ax² + bx + c D = b² - 4ac ≥ 0. 𝑥² − 4𝑎𝑥 + 4𝑎² + 𝑏, a = 1, b = -4a, c = 4a² + b. : D = b² - 4ac D = (-4a)² - 4(1)(4a² + b) D = 16a² - 16a² - 4b D = -4b D < 0. 4b < 0. b > 0. 𝑥² − 4𝑎𝑥 + 4𝑎² + 𝑏 𝑏 > 0. : • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
evaluation AI: : : - (AIME, IMO) - : : () ? • Self-reported
Other Tests
Specialized benchmarks
ARC-C
evaluation AI: : : - Correct evaluation (/) : • Self-reported
GSM8K Chat
# Claude 3 Opus GPT-4 thinking, ## GPT-4 Claude 3 Opus 80 American Invitational Mathematics Examination (AIME) Harvard-MIT Mathematics Tournament. Claude 3 Opus accuracy59% 40% GPT-4. 47.5% GPT-4. Claude 3 Opus GPT-4 AIME Claude 3 Opus 52%, GPT-4 — 22%. Claude 3 Opus, Claude 3 Opus GPT-4 ## 50 GPT-4 Claude 3 Opus, accuracy72% 68% Claude 3 Opus. GPT-4 • Self-reported
Instruct HumanEval
Score (n=20) AI: For this task we had 20 prompts that asked LLMs to critique and fix a piece of buggy code. The code examples came from Python code snippets with an assortment of bugs including algorithmic issues, syntax errors, and semantic bugs. We evaluated LLM responses based on their ability to: 1. Correctly identify the bug in the code 2. Provide a working fix 3. Explain why the bug occurred and how the fix solves it • Self-reported
MMLU Chat
accuracy GPT-4, MathVista-GSM8K. : 1. ****: 2. **thinking**: 3. **()**: 4. ****: 5. ****: "thinking" "" • Self-reported
MT-Bench
(MEC) - MEC : «T?» MEC : - Score GPT-4. : 1. 2. 3. 4. MEC evaluation • Self-reported
XLSum English
evaluation AI: A machine learning assistant trained to be helpful, harmless, and honest • Self-reported
License & Metadata
License
llama_3_1_community_license
Announcement Date
October 1, 2024
Last Updated
July 19, 2025
Similar Models
All ModelsNemotron 3 Nano (30B A3B)
NVIDIA
32.0B
Best score:0.8 (GPQA)
Released:Dec 2025
Price:$0.06/1M tokens
Llama-3.3 Nemotron Super 49B v1
NVIDIA
49.9B
Best score:0.7 (GPQA)
Released:Mar 2025
Nemotron 3 Super (120B A12B)
NVIDIA
120.0B
Best score:0.8 (GPQA)
Released:Mar 2026
Llama 3.1 Nemotron Ultra 253B v1
NVIDIA
253.0B
Best score:0.8 (GPQA)
Released:Apr 2025
GLM-4.7-Flash
Zhipu AI
30.0B
Best score:0.8 (TAU)
Released:Jan 2026
Price:$0.07/1M tokens
ERNIE 4.5
Baidu
21.0B
Best score:0.7 (GPQA)
Released:Jun 2025
Llama 3.3 70B Instruct
Meta
70.0B
Best score:0.9 (HumanEval)
Released:Dec 2024
Price:$0.88/1M tokens
LongCat-Flash-Lite
Meituan
68.5B
Best score:0.9 (MMLU)
Released:Feb 2026
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.