Llama 3.1 Nemotron Ultra 253B v1

Name: Llama 3.1 Nemotron Ultra 253B v1
Author: NVIDIA

NVIDIA

Llama 3.1 Nemotron Ultra 253B v1 is NVIDIA's largest Nemotron model, built on the Llama 3.1 architecture with 253 billion parameters. It delivers top-tier performance across reasoning, coding, mathematics, and expert knowledge tasks, designed for the most demanding research and enterprise workloads.

Key Specifications

Parameters

253.0B

Context

Release Date

April 7, 2025

Average Score

79.2%

Research Paper Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

April 7, 2025

Last Update

July 19, 2025

Today

July 7, 2026

Technical Specifications

Parameters

253.0B

Training Tokens

Knowledge Cutoff

December 1, 2023

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Reasoning

Logical reasoning and analysis

GPQA

## **Pass@1, Reasoning** — ## : - ## 1. 2. 3. 4. () ## Pass@1, Reasoning accuracy thinking - ## • Self-reported

76.0%

Other Tests

Specialized benchmarks

AIME 2025

Pass@1, AI: : - b 2 - u 21 b = 2, b 2-"a", "b", b - 2-u = 21 u 21-: a=1, b=2, c=3, d=4, e=5, f=6, g=7, h=8, i=9, j=10, k=11, l=12, m=13, n=14, o=15, p=16, q=17, r=18, s=19, t=20, u=21. u - 21-n. n 14-: a=1, b=2, c=3, d=4, e=5, f=6, g=7, h=8, i=9, j=10, k=11, l=12, m=13, n=14. : n = 14 • Self-reported

72.5%

BFCL v2

Score, Reasoning AI: ## Overview DINNA (**D**eep **I**nterpretability via **N**eural **N**etwork **A**nalysis) is an advanced model analysis technique that combines activation analysis with interpretability methods to gain deeper insights into how large language models process and generate information. This approach utilizes both quantitative metrics and qualitative assessment to evaluate model behavior across different architectural components, with a particular focus on attention patterns and neuron activations. ## Key Components 1. **Activation Tracking**: Records neuron firing patterns across all layers during inference 2. **Attention Flow Analysis**: Maps information propagation through attention mechanisms 3. **Representational Analysis**: Identifies concept formation in hidden states 4. **Causal Intervention**: Tests model behavior when specific neurons or pathways are modified 5. **Interpretation Framework**: Correlates internal patterns with observable outputs ## Methodology The DINNA process involves a systematic examination of model internals: 1. **Baseline Performance Measurement** - Establish model performance on benchmark tasks - Record comprehensive activation profiles 2. **Component-Level Analysis** - Identify high-activation neurons during specific reasoning steps - Map attention patterns when processing key information - Analyze embedding space geometry for concept representation 3. **Causal Experimentation** - Selectively ablate neurons to measure impact on performance - Modify attention weights to test hypothesized information pathways - Inject controlled noise to test robustness of reasoning paths 4. **Pattern Recognition** - Apply clustering algorithms to identify functional neuron groups - Develop visualization tools for attention flow and activation patterns - Correlate internal representations with external knowledge bases ## Applications DINNA has been successfully applied to analyze: - Mathematical reasoning capabilities in transformer models - Factual recall mechanisms in LLMs - Decision boundary formation in classification tasks - Information routing during multi-step reasoning ## Technical Requirements - Full model weight access - Computational resources for extensive tracing - Custom instrumentation for activation recording - Analysis toolkit for pattern identification ## Limitations - Computational intensity limits application to very large models - Interpretation remains partially subjective - Cannot fully account for emergent behaviors - Requires domain expertise to properly contextualize findings • Self-reported

74.1%

IFEval

accuracy, AI: Claude, GPT-4, and GPT-4o have proven themselves to be powerful across a range of domains. But when do they make mistakes? Why do they make mistakes? How can we fix them? The Strict Accuracy team has worked on these questions extensively. We've developed a methodology for identifying weaknesses in language models, which they may not themselves be aware of. By finding areas where a model fails to "know that it doesn't know," we can reduce hallucination and improve performance. Our methodology relies on strict evaluation, with zero tolerance for factual errors. Here's our approach: 1. Collect challenging problems from various domains (math, physics, etc.) 2. Have humans solve these problems with extensive verification 3. Compare AI solutions to ground truth human solutions 4. Analyze failure patterns when AI confidence doesn't match accuracy 5. Use these insights to develop more robust systems This work has led to breakthroughs in model evaluation and improvement. It helps us build models that not only perform well on standard benchmarks but also acknowledge their limitations • Self-reported

89.5%

LiveCodeBench

Pass@1, AI: I'm going to solve this step by step. First, I need to understand what "Suppose $T_S=60$. How many ways can you cut an $n \times n$ grid into 1 by 3 rectangles?" is asking. - We have an n×n grid. - We need to cut it into 1×3 rectangles. - We need to count the number of ways to do this. For this to be possible, the total number of cells in the grid must be divisible by 3, since each 1×3 rectangle contains 3 cells. Total cells in n×n grid = n². So we need n² to be divisible by 3. Now, I notice that $T_S=60$. I need to understand what $T_S$ refers to. Wait, this could be referring to the sum of all possible ways for all valid values of n. Let me think about which values of n make n² divisible by 3: - If n ≡ 0 (mod 3), then n² ≡ 0 (mod 3) - If n ≡ 1 (mod 3), then n² ≡ 1 (mod 3), not divisible by 3 - If n ≡ 2 (mod 3), then n² ≡ 4 ≡ 1 (mod 3), not divisible by 3 So only when n is a multiple of 3 can we tile the grid with 1×3 rectangles. For a 3×3 grid, I need to count the number of ways to tile it with 1×3 rectangles. Each 1×3 rectangle can be placed horizontally or vertically. Actually, I need to reconsider. The problem might be asking for the value of n where the number of ways to tile is 60. Let me approach this differently. For an n×n grid to be tileable with 1×3 rectangles: 1. n² must be divisible by 3 2. There must be exactly 60 ways to tile it Let's try to find which value of n satisfies these conditions. For n = 3, the 3×3 grid has 9 cells, which is divisible by 3. For n = 6, the 6×6 grid has 36 cells, which is divisible by 3. For n = 9, the 9×9 grid has 81 cells, which is divisible by 3. I need to calculate the number of ways to tile these grids and find which one gives 60. Actually, this is a complex counting problem. The number of ways to tile a grid with 1×3 rectangles grows rapidly with n. Let me try a simpler approach. Since $T_S=60$, and assuming $T_S$ is the number of ways to tile an n×n grid, I need to find n where the number of tilings is 60. Let me check if there's any known formula or pattern for this. Actually, without more context, I'll make an educated guess based on the given information. The most likely interpretation is that n = 6, resulting in 60 different ways to tile a 6×6 grid with 1×3 rectangles. Therefore, n = 6 is my answer • Self-reported

66.3%

MATH-500

Pass@1, Reasoning AI: Experiments using Pass@1 with Reasoning help measure a model's ability to solve multi-step tasks that involve structured reasoning. Pass@1 is similar to Pass@k in that it measures a model's success rate on a problem when it's allowed to try once. The "Reasoning" component means that the model attempts to solve problems through step-by-step reasoning. In this approach, the model is typically prompted to solve a problem by showing its reasoning, similar to how a human might write out calculations or logical steps before arriving at a final answer. By forcing the model to expose its "work," researchers can better understand how the model approaches problems and where it might make errors in its reasoning process. This metric is particularly useful for evaluating performance on mathematical problems, logical puzzles, or any tasks that benefit from explicit reasoning rather than direct answer generation. Studies have shown that prompting language models to reason step-by-step before providing an answer generally leads to better performance than asking for the answer directly. When conducting Pass@1 Reasoning experiments, researchers typically: 1. Present the model with a problem 2. Explicitly instruct it to solve the problem by showing its reasoning 3. Evaluate whether the final answer is correct (Pass@1) 4. Optionally, analyze the reasoning steps to identify patterns in the model's problem-solving approach The resulting success rate indicates how often the model can correctly solve problems on the first attempt when using a reasoning-based approach. • Self-reported

97.0%

License & Metadata

License

llama_3_1_community_license

Announcement Date

April 7, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Nemotron 3 Super (120B A12B)

NVIDIA

120.0B

Best score:0.8 (GPQA)

Released:Mar 2026

Nemotron 3 Nano (30B A3B)

NVIDIA

32.0B

Best score:0.8 (GPQA)

Released:Dec 2025

Price:$0.06/1M tokens

Llama 3.1 Nemotron 70B Instruct

NVIDIA

70.0B

Best score:0.8 (MMLU)

Released:Oct 2024

Jamba 1.5 Large

AI21 Labs

398.0B

Best score:0.9 (ARC)

Released:Aug 2024

Price:$2.00/1M tokens

DeepSeek-V3.1

DeepSeek

671.0B

Best score:0.8 (GPQA)

Released:Jan 2025

Price:$0.27/1M tokens

MiniMax M2

MiniMax

230.0B

Best score:0.8 (GPQA)

Released:Oct 2025

Price:$1.00/1M tokens

DeepSeek R1 Zero

DeepSeek

671.0B

Best score:0.7 (GPQA)

Released:Jan 2025

Command R+

Cohere

104.0B

Best score:0.8 (MMLU)

Released:Aug 2024

Price:$0.25/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.