Llama 3.1 8B Instruct

Name: Llama 3.1 8B Instruct
Author: Meta

Meta

Llama 3.1 8B Instruct is a multilingual large language model optimized for conversational tasks. It has a 128K token context window, state-of-the-art tool use capabilities, and strong reasoning abilities.

Key Specifications

Parameters

8.0B

Context

131.1K

Release Date

July 23, 2024

Average Score

61.3%

API Documentation Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

July 23, 2024

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

8.0B

Training Tokens

15.0T tokens

Knowledge Cutoff

December 31, 2023

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.20

Output (per 1M tokens)

$0.20

Max Input Tokens

131.1K

Max Output Tokens

131.1K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

5-shot • Self-reported

69.4%

Programming

Programming skills tests

HumanEval

In 0-shot model receives question, which she/it should solve directly, without provision examples or capabilities how to solving tasks. This most prompt, in which model receives direct query and should immediately indeed give answer. allows evaluate basic abilities model solve tasks without which-or additional help or context • Self-reported

72.6%

Reasoning

Logical reasoning and analysis

DROP

prompt • Self-reported

59.5%

GPQA

and understanding all value in world AI. Although LLM, such how GPT, in language, they not were for information. However multimodal systems, such how GPT-4 with Vision and Claude 3, models and understand images and text Tasks understanding from and processing to descriptions complex visual and text with images. Models, capabilities, should ability exactly interpret images, data from visual and and information for execution tasks. For evaluation visual capabilities model we we verify several key aspects: 1. Exact description images 2. text in (OCR) 3. Analysis and data 4. Understanding complex visual 5. and information This evaluation allows determine, how well well model can "" and interpret new capabilities for applications AI in various fields • Self-reported

30.4%

Other Tests

Specialized benchmarks

API-Bank

0-shot AI, 0-shot: Model receives task without preliminary examples, instructions or context. Answer exclusively on training model and data in task. for determination basic abilities and limitations model. AI • Self-reported

82.6%

ARC-C

In this method we we offer language model directly execute task, without any-or examples or instructions about For tasks with choice answer we we provide question and answers, we ask model choose answer and explain its choice. For tasks with answer we simply we ask model answer on question. Answer directly from answer model. Query for tasks with choice answer: ``` Please, on following question and reasoning. <question> (A) <option A> (B) <option B> ... ``` Query for tasks with answer: ``` Please, on following question. <question> ``` • Self-reported

83.4%

BFCL

In our research we we evaluate GPT for training generate and images. Using 0-shot approach, we we provide model task without preliminary examples or instructions by format answer. This most complex since model should understand task and answer, relying on only on its preliminarily trained knowledge. This approach better reflects real scenarios use, where users often questions without answers. We we analyze: 1. Ability model correctly interpret queries 2. Quality explanations without preliminary example 3. Accuracy answers at additional context 4. to various questions Evaluation performance in 0-shot scenarios especially important for understanding model in real conditions, where users provide samples answers • Self-reported

76.1%

Gorilla Benchmark API Bench

0-shot AI: In this mode model answers on question directly without any-or additional instructions, or instructions, how think. This relates to to use, when user simply question, and model gives answer without additional For example, if query "Solve equation x² + 5x + 6 = 0", system simply solves equation directly. This basic mode for majority with LLM, which abilities model without any-or additional thinking • Self-reported

8.2%

GSM-8K (CoT)

8-shot • Self-reported

84.5%

IFEval

text for analysis. I help with by but me text on language • Self-reported

80.4%

MATH (CoT)

0-shot In context large language models (LLM) "0-shot" relates to to abilities model perform task without any-or examples. Model should rely exclusively on knowledge, obtained in time preliminary training, in order to understand task and generate answer. When 0-shot approach user simply describes task or question, not providing samples that, how should look answer. This with few-shot approach, at which user provides one or several examples, format or way reasoning. 0-shot testing — strict verification understanding model tasks and her/its abilities apply its knowledge to new without additional context or examples. This also most common way with LLM in scenarios use • Self-reported

51.9%

MBPP EvalPlus (base)

## Evaluation without examples Evaluation without examples (0-shot) - this approach, at which model LLM solves task without any-or preliminary examples or samples. Model should use only instructions in and its preliminarily trained knowledge for formation answer. ### Application 0-shot evaluation usually is applied for: - basic capabilities model without additional help - Evaluations abilities model understand and follow instructions - knowledge, in time preliminary training - level performance for comparison with other methods queries ### Advantages - : not requires creation examples - real scenarios use, when examples internal knowledge model, and not ability templates ### Limitations - gives more results by comparison with few-shot methods - Model can incorrectly understand task without examples - for complex or tasks ### Example query ``` Solve task: value x in 3x + 7 = 22 ``` This query not contains examples that, how should answer or which steps solutions • Self-reported

72.8%

MMLU (CoT)

standard 0-shot that model performs task without examples. During many cases 0-shot consists from execution tasks, simply model answer, often in case tasks with answers. For tasks with choice answer, model can execute 0-shot task, simply correct answer. In more complex tasks model can generate reasoning, to answer. In tasks without choice answer model should not only generate answer, but and determine format answer. are provided additional instructions, format answer. In other cases model can format answer from tasks. Important note, that cases, where model determine, which answer from them • Self-reported

73.0%

MMLU-Pro

5-shot • Self-reported

48.3%

Multilingual MGSM (CoT)

Method 0-shot that you simply LLM question, answer and immediately this answer. This most approach with points view use and very for computation scores on large sets questions. However such approach can not identify full model, since not allows it and correct its answers. this method, how demonstrates performance by comparison with which allow model answers, use various approaches to solving problems or information. not less, this allows quickly compare base performance various models, especially when is capability evaluate questions for one times • Self-reported

68.9%

Multipl-E HumanEval

Method "one example" In this method we model one task, not examples or instructions about that, how her/its solve. This standard way evaluation work LLM in benchmarks. Example Query: If I on with 50 in how many time me in order to 450 ? this This basic approach to evaluation models, which gives representation about that, how well well model "understands" task without additional context. When he most efficient This method well works for simple tasks or when model already solving specific type tasks in its data. Disadvantages For more complex tasks or those, which require approaches, training on one example often Model can not understand format or interpret task incorrectly • Self-reported

50.8%

Multipl-E MBPP

0-shot AI: In given mode we simply we ask model directly answer on question without any-or additional instructions. For example, "most in ?" or "How ?". queries allow evaluate basic knowledge model, but give information about her/its reason. 0-shot testing usually shows results for simple questions, but handles with complex tasks • Self-reported

52.4%

Nexus

0-shot AI: means use LLM for solutions new tasks, without provision it examples that, how perform this task or additional instructions, assignments. This approach important for testing, since evaluates, how model can independently interpret task and apply its knowledge, that more on then, how model are used in real and this best performance model in new situations. For example, in mathematical task 0-shot would, that model simply is provided task, such how "Solve equation: 2x + 5 = 15", without examples solutions similar or instructions by solutions • Self-reported

38.5%

License & Metadata

License

llama_3_1_community_license

Announcement Date

July 23, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Llama 3.2 3B Instruct

Meta

3.2B

Best score:0.8 (ARC)

Released:Sep 2024

Price:$0.01/1M tokens

Gemma 2 9B

Google

9.2B

Best score:0.7 (MMLU)

Released:Jun 2024

Phi 4 Mini

Microsoft

3.8B

Best score:0.8 (ARC)

Released:Feb 2025

Phi-3.5-mini-instruct

Microsoft

3.8B

Best score:0.8 (ARC)

Released:Aug 2024

Price:$0.10/1M tokens

Qwen2.5 7B Instruct

Alibaba

7.6B

Best score:0.8 (HumanEval)

Released:Sep 2024

Price:$0.30/1M tokens

Qwen2 7B Instruct

Alibaba

7.6B

Best score:0.8 (HumanEval)

Released:Jul 2024

Llama 3.1 405B Instruct

Meta

405.0B

Best score:1.0 (ARC)

Released:Jul 2024

Price:$3.50/1M tokens

Llama 3.1 70B Instruct

Meta

70.0B

Best score:0.9 (ARC)

Released:Jul 2024

Price:$0.89/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.