Llama 3.1 405B Instruct

Name: Llama 3.1 405B Instruct
Author: Meta

Meta

Llama 3.1 405B Instruct is a large language model optimized for multilingual conversational tasks. It outperforms many available open-source and closed chat models on standard industry benchmarks. The model supports 8 languages and has a context window of 128K tokens.

Key Specifications

Parameters

405.0B

Context

128.0K

Release Date

July 23, 2024

Average Score

79.2%

API Documentation Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

July 23, 2024

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

405.0B

Training Tokens

15.0T tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$3.50

Output (per 1M tokens)

$3.50

Max Input Tokens

128.0K

Max Output Tokens

128.0K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

5-shot, macro_avg/acc • Self-reported

87.3%

Programming

Programming skills tests

HumanEval

0-shot, pass@1 • Self-reported

89.0%

Mathematics

Mathematical problems and computations

GSM8k

8-shot, CoT, em_maj1@1 In this approach we we use chain thinking (CoT), which with different in order to solve task. We we compute answer by means of decision-making most answer from solutions. More specifically: 1. We 8 various CoT with different 2. In each model over task and generates answer. 3. Then we final answer from each 4. We we choose most common answer (etc.). such approach in that, that he can help errors, which model can in several times and most common answer, we we can with obtain correct answer • Self-reported

96.8%

MATH

0-chain thinking, final exact answer • Self-reported

73.8%

Reasoning

Logical reasoning and analysis

DROP

0-shot - this method, at which model artificial intelligence solves new task, not nor one example correct solutions. Model exclusively on its preliminarily knowledge and abilities for execution tasks. For example, if model receives task, she/it should solve her/its, relying on only on its existing knowledge about mathematics, without access to examples solutions similar tasks. This approach verifies basic abilities model and her/its training, and not ability adapt to specific or 0-shot testing is considered one from most methods evaluation and understanding model, since requires application general knowledge to specific task without additional prompts or training • Self-reported

84.8%

GPQA

prompt or not for that, in order to model tool. You directly ask model execute task, and she/it her/its, using all tools by This approach is for general understanding capabilities model, but can be with points view that, which tools will or Examples queries: - "me about in " - "me about " - "about " • Self-reported

50.7%

Other Tests

Specialized benchmarks

API-Bank

0-shot ability model perform task without any-or examples, instructions or explanations. Models only question or problem. For example, at solving mathematical tasks: question: "number n such, that n² - 20n + 96 is " Models not are provided or samples solutions. Model should independently approach to solving and execute all necessary steps • Self-reported

92.0%

ARC-C

testing, or 0-shot, provision model prompts or query without any-or preliminary examples or additional context, and result. This approach evaluates ability model apply its preliminarily obtained knowledge to new task without 0-shot approach especially for evaluation, how well well model can its knowledge in various contexts. 0-shot testing consists in its and efficiency. in examples for each tasks and can give more representation about performance model in real conditions, where users provide context. However 0-shot approach can not fully model, especially in complex or specialized tasks, where was would useful • Self-reported

96.9%

BFCL

query: below text with on Russian. : (text will ) 0-shot means, that model tries execute task directly from instructions in without additional examples. This approach measures ability model immediately understand task without demonstration that, how she/it should be is few-shot, where we we show model one or several examples execution tasks before that, how she/it should execute task • Self-reported

88.5%

Gorilla Benchmark API Bench

0-shot This method not provides no/none demonstrations or training examples for execution tasks. Instead this, model should rely on knowledge, obtained in time preliminary training, in order to generate solutions. • Self-reported

35.3%

IFEval

# Standard ## What this such testing — most simple and way evaluate model. is used, in order to basic scores performance and comparison results. testing, how consists from specific set queries and tasks, often or in field testing, which are evaluated on match in advance criteria. ## When its use - When you need to quickly compare performance several models for decision-making solutions. - For basic scores, on basis which can conduct more analysis. - When you identify problems before more evaluation. ## and execution. - comparison various models. - Allows data about performance with ## understanding real capabilities model. - Can give representation about complex models. - Not nuances, which can be critically for specific cases use. ## When this testing when you need to understanding capabilities model, especially at evaluation: - reasoning and processes - answers on different queries - to various interpretation queries with various in complex, tasks ## methods - execution standard tests, such how MMLU, GSM8K, HumanEval - with answers - Evaluation accuracy execution simple instructions - Verification base ## testing can improve by means of: - more diverse tasks, various and skills - more complex examples, which capabilities model - tools for evaluation, in order to process - results with time for identification • Self-reported

88.6%

MBPP EvalPlus

0-shot, base, pass@1 • Self-reported

88.6%

MMLU (CoT)

0-macro_avg/acc • Self-reported

88.6%

MMLU-Pro

5-shot, CoT, micro_avg/acc_char • Self-reported

73.3%

Multilingual MGSM (CoT)

0-shot, CoT, em model in mode chains reasoning (CoT). For 0-shot we use "step by step", without any-or examples. "em" means "exact match" - verification whether model answer, which in accuracy matches with correct answer for given tasks. that in some cases model can to correct but from-for in format or evaluation em can be strict • Self-reported

91.6%

Multipl-E HumanEval

0-shot, pass@1 AI: This method, which measures, how well often model can solve task with first attempts without examples for training. Model receives task and should provide correct answer with first attempts. For problems with answer, how in mathematics, is considered, that model "" task, if her/its first answer Metric pass@1 measures proportion tasks, which model solved correctly with first attempts. "0-shot" means, that model not are provided examples solutions similar tasks. In difference from few-shot approach, where model several examples tasks with solutions before at 0-shot approach model should rely only on knowledge, obtained in time preliminary training. Metric pass@1 is strict evaluation, since not allows model do several attempts or correct errors. This makes her/its reliability model for where is required accuracy with first attempts • Self-reported

75.2%

Multipl-E MBPP

0-shot, pass@1 • Self-reported

65.7%

Nexus

0-shot, macro_avg/acc • Self-reported

58.7%

License & Metadata

License

llama_3_1_community_license

Announcement Date

July 23, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Llama 3.3 70B Instruct

Meta

70.0B

Best score:0.9 (HumanEval)

Released:Dec 2024

Price:$0.88/1M tokens

Llama 3.1 70B Instruct

Meta

70.0B

Best score:0.9 (ARC)

Released:Jul 2024

Price:$0.89/1M tokens

Llama 4 Maverick

Meta

MM400.0B

Best score:0.9 (MMLU)

Released:Apr 2025

Price:$0.27/1M tokens

Kimi K2 0905

Moonshot AI

1.0T

Best score:0.9 (HumanEval)

Released:Sep 2025

Price:$0.60/1M tokens

Llama 3.1 Nemotron Ultra 253B v1

NVIDIA

253.0B

Best score:0.8 (GPQA)

Released:Apr 2025

GLM-4.7

Zhipu AI

358.0B

Best score:0.9 (TAU)

Released:Dec 2025

Price:$0.60/1M tokens

Qwen3-Coder 480B A35B Instruct

Alibaba

480.0B

Best score:0.8 (TAU)

Released:Jan 2025

DeepSeek-V3

DeepSeek

671.0B

Best score:0.9 (MMLU)

Released:Dec 2024

Price:$0.27/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.