GPT-4o mini

Name: GPT-4o mini
Author: OpenAI

Multimodal

OpenAI

GPT-4o mini is OpenAI's latest cost-effective small model designed to make AI more accessible and affordable. It excels in text intelligence and multimodal reasoning, surpassing previous models like GPT-3.5 Turbo. With a 128K token context window and text and vision support, it offers low-cost real-time applications such as customer support chatbots. Priced at 15 cents per million input tokens and 60 cents per million output tokens, it is significantly cheaper than its predecessors. Safety is a priority with built-in measures and improved resistance to security threats.

Key Specifications

Parameters

Context

128.0K

Release Date

July 18, 2024

Average Score

63.5%

API Documentation Results Blog

Timeline

Key dates in the model's history

Announcement

July 18, 2024

Last Update

July 19, 2025

Today

July 7, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

October 1, 2023

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.15

Output (per 1M tokens)

$0.60

Max Input Tokens

128.0K

Max Output Tokens

16.4K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

Accuracy AI: The percentage of answers on the benchmark that the model got correct. • Self-reported

82.0%

Programming

Programming skills tests

HumanEval

Pass@1 - this metric evaluation, used for measurement abilities model solve tasks with first attempts. She/It represents itself proportion tasks, which model can solve correctly at solutions. For tasks, which have answer (for example, mathematical problems), solution is considered correct, if final answer For tasks programming solution is considered correct, if code passes all test cases. Pass@1 especially useful for applications, where users usually on one solution. This metric shows reliability model in real scenarios, where no capabilities choose best result from several attempts. Although Pass@1 is metric, she/it can capabilities model, since some model can generate correct answers with specific even if this probability not achieves 100% • Self-reported

87.2%

SWE-Bench Verified

Indicator AI:# samalba/tsuru // Copyright 2013 tsuru authors. All rights reserved. // Use of this source code is governed by a BSD-style // license that can be found in the LICENSE file. package cmd import ( "bytes" "fmt" "io" "os" "regexp" "sort" "strings" "syscall" "github.com/tsuru/tsuru/fs" "launchpad.net/gnuflag" ) type exiter interface { Exit(int) } type osExiter struct{} func (e osExiter) Exit(code int) { os.Exit(code) } type Manager struct { Commands map[string]Command topics map[string]string name string stdout io.Writer stderr io.Writer stdin io.Reader version string versionDesc string e exiter original string wrong bool normalizeFunc func(string) string } func NewManager(name, ver, verDesc string, stdout, stderr io.Writer, stdin io.Reader) *Manager { manager := &Manager{name: name, version: ver, versionDesc: verDesc, stdout: stdout, stderr: stderr, stdin: stdin} manager.Register(&help{manager}) manager.Register(&version{manager}) return manager } func BuildBaseManager(name, version, versionDesc string) *Manager { m := NewManager(name, version, versionDesc, os.Stdout, os.Stderr, os.Stdin) m.Register(&login{}) m.Register(&logout{}) m.Register(&targetList{}) m.Register(&targetAdd{}) m.Register(&targetRemove{}) m.Register(&targetSet{}) m.Register(&userCreate{}) m.Register(&resetPassword{}) m.Register(&userRemove{}) m.Register(&teamCreate{}) m.Register(&teamRemove{}) m.Register(&teamList{}) m.Register(&teamUserAdd{}) m.Register(&teamUserRemove{}) m.Register(&shellToContainerCmd{}) m.Register(&appCreate{}) m.Register(&appRemove{}) m.Register(&appList{}) m.Register(&appGrant{}) m.Register(&appRevoke{}) m.Register(&appLog{}) m.Register(&appRun{}) m.Register(&appRestart{}) m.Register(&appStart{}) m.Register(&appStop{}) m.Register(&envGet{}) m.Register(&envSet{}) m.Register(&envUnset{}) m.Register(&keyAdd{}) m.Register(&keyRemove{}) m.Register(&serviceList{}) m.Register(&serviceAdd{}) m.Register(&serviceRemove{}) m.Register(&serviceDoc{}) m.Register(&serviceBind{}) m.Register(&serviceUnbind{}) m.Register(&serviceInfo{}) • Self-reported

8.7%

Mathematics

Mathematical problems and computations

MATH

Accuracy AI, Inc.'s GPQA, a benchmark for evaluating LLMs on graduate-level reasoning, consists of questions created by experts. The benchmark is highly trusted as a measure of model accuracy, and we've seen a clear shift in the industry towards prioritizing model performance on this benchmark. While helpful, I'd like to take a nuanced view of how we think about accuracy in large language models. I'm a huge advocate for creating benchmarks that genuinely push the frontier of AI capabilities. However, I'm also mindful that benchmark gaming has become increasingly common as the stakes for AI companies have grown. From my perspective, accuracy is a multifaceted concept when it comes to evaluating LLMs: 1. Correctness on domain-specific knowledge 2. Logical reasoning abilities 3. Capacity to produce valid, non-hallucinated information 4. Ability to acknowledge uncertainty appropriately 5. Consistency in answers across multiple attempts For our model comparison, I'll share metrics on GPQA scores since they're the industry standard, but I'll also highlight other dimensions of accuracy that may not be captured in these benchmarks. This includes qualitative assessment of the models' tendencies to hallucinate, their calibration (how well their expressed confidence matches actual accuracy), and their consistency across repeated prompts. • Self-reported

70.2%

MGSM

Accuracy AI: This metric evaluates accuracy model in information and execution instructions. accuracy includes in itself: • accuracy: information and • accuracy: instructions user • : use exact and on • Reasoning: logical and thinking at solving tasks Although accuracy model can be full evaluation often requires : • Match answer • and facts • at information • tool use for knowledge • Self-reported

87.0%

Reasoning

Logical reasoning and analysis

DROP

F1 Score F1-measure represents itself harmonic average between accuracy and providing metric, which effectively ability model results (accuracy) and results (). How harmonic average, F1-measure on values from its When various models with accuracy, model with more F1 will have more high and will cases. This makes F1-especially in scenarios, where results have high for example, in tasks or F1-measure is calculated by : F1 = 2 × (accuracy × ) / (accuracy + ) where accuracy and how: Accuracy = / (+ ) = / (+ ) • Self-reported

79.7%

GPQA

Accuracy AI • Self-reported

40.2%

Multimodal

Working with images and visual data

MathVista

Accuracy AI: provides correct and exact answer : - : answer fully matches task - : All necessary steps ; answer fully Match: Answer matches query • Self-reported

56.7%

MMMU

Accuracy AI: 1 Human: 0 We we measure accuracy answers model on questions from tests level by mathematics, and including by programming, such how International Mathematics Olympiad (IMO), International Physics Olympiad (IPhO), American Invitational Mathematics Examination (AIME), USA Physics Olympiad (USAPhO), USA Programming Olympiad (USACO), and FrontierMath. For creation these tests we we use questions from competitions. We also we measure accuracy in AI, such how Massive Multitask Language Understanding (MMLU), GPS, and GPQA. MMLU measures knowledge by on level including and other. GPS MMLU by means of complex tasks by solving problems. GPQA represents itself test on knowledge in and using questions and answers, experts domain field, usually requiring reasoning and understanding We also we measure accuracy in additional benchmarks for coding, including HumanEval, MBPP, and DS-1000. HumanEval includes 164 tasks code in mainly for and level programming. MBPP offers set from 974 basic tasks by programming. DS-1000 use for science about data on Python, such how Pandas, NumPy, SciPy, TensorFlow, PyTorch and other • Self-reported

59.4%

License & Metadata

License

proprietary

Announcement Date

July 18, 2024

Last Updated

July 19, 2025

Similar Models

All Models

o4-mini

OpenAI

Best score:0.8 (GPQA)

Released:Apr 2025

Price:$1.10/1M tokens

GPT-4o

OpenAI

Best score:0.9 (MMLU)

Released:Aug 2024

Price:$2.50/1M tokens

GPT-4.1

OpenAI

Best score:0.9 (MMLU)

Released:Apr 2025

Price:$2.00/1M tokens

o3

OpenAI

Best score:0.8 (GPQA)

Released:Apr 2025

Price:$2.00/1M tokens

GPT-4.5

OpenAI

Best score:0.9 (MMLU)

Released:Feb 2025

Price:$75.00/1M tokens

GPT-5 nano

OpenAI

Best score:0.7 (GPQA)

Released:Aug 2025

Price:$0.05/1M tokens

GPT-4

OpenAI

Best score:1.0 (ARC)

Released:Jun 2023

Price:$30.00/1M tokens

GPT-4o

OpenAI

Best score:0.9 (HumanEval)

Released:May 2024

Price:$2.50/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.