OpenAI logo

GPT-4o mini

Multimodal
OpenAI

GPT-4o mini is OpenAI's latest cost-effective small model designed to make AI more accessible and affordable. It excels in text intelligence and multimodal reasoning, surpassing previous models like GPT-3.5 Turbo. With a 128K token context window and text and vision support, it offers low-cost real-time applications such as customer support chatbots. Priced at 15 cents per million input tokens and 60 cents per million output tokens, it is significantly cheaper than its predecessors. Safety is a priority with built-in measures and improved resistance to security threats.

Key Specifications

Parameters
-
Context
128.0K
Release Date
July 18, 2024
Average Score
63.5%

Timeline

Key dates in the model's history
Announcement
July 18, 2024
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
-
Training Tokens
-
Knowledge Cutoff
October 1, 2023
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.15
Output (per 1M tokens)
$0.60
Max Input Tokens
128.0K
Max Output Tokens
16.4K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
Accuracy AI: The percentage of answers on the benchmark that the model got correct.Self-reported
82.0%

Programming

Programming skills tests
HumanEval
Pass@1 - this metric evaluation, used for measurement abilities model solve tasks with first attempts. She/It represents itself proportion tasks, which model can solve correctly at solutions. For tasks, which have answer (for example, mathematical problems), solution is considered correct, if final answer For tasks programming solution is considered correct, if code passes all test cases. Pass@1 especially useful for applications, where users usually on one solution. This metric shows reliability model in real scenarios, where no capabilities choose best result from several attempts. Although Pass@1 is metric, she/it can capabilities model, since some model can generate correct answers with specific even if this probability not achieves 100%Self-reported
87.2%
SWE-Bench Verified
Indicator AI:# samalba/tsuru // Copyright 2013 tsuru authors. All rights reserved. // Use of this source code is governed by a BSD-style // license that can be found in the LICENSE file. package cmd import ( "bytes" "fmt" "io" "os" "regexp" "sort" "strings" "syscall" "github.com/tsuru/tsuru/fs" "launchpad.net/gnuflag" ) type exiter interface { Exit(int) } type osExiter struct{} func (e osExiter) Exit(code int) { os.Exit(code) } type Manager struct { Commands map[string]Command topics map[string]string name string stdout io.Writer stderr io.Writer stdin io.Reader version string versionDesc string e exiter original string wrong bool normalizeFunc func(string) string } func NewManager(name, ver, verDesc string, stdout, stderr io.Writer, stdin io.Reader) *Manager { manager := &Manager{name: name, version: ver, versionDesc: verDesc, stdout: stdout, stderr: stderr, stdin: stdin} manager.Register(&help{manager}) manager.Register(&version{manager}) return manager } func BuildBaseManager(name, version, versionDesc string) *Manager { m := NewManager(name, version, versionDesc, os.Stdout, os.Stderr, os.Stdin) m.Register(&login{}) m.Register(&logout{}) m.Register(&targetList{}) m.Register(&targetAdd{}) m.Register(&targetRemove{}) m.Register(&targetSet{}) m.Register(&userCreate{}) m.Register(&resetPassword{}) m.Register(&userRemove{}) m.Register(&teamCreate{}) m.Register(&teamRemove{}) m.Register(&teamList{}) m.Register(&teamUserAdd{}) m.Register(&teamUserRemove{}) m.Register(&shellToContainerCmd{}) m.Register(&appCreate{}) m.Register(&appRemove{}) m.Register(&appList{}) m.Register(&appGrant{}) m.Register(&appRevoke{}) m.Register(&appLog{}) m.Register(&appRun{}) m.Register(&appRestart{}) m.Register(&appStart{}) m.Register(&appStop{}) m.Register(&envGet{}) m.Register(&envSet{}) m.Register(&envUnset{}) m.Register(&keyAdd{}) m.Register(&keyRemove{}) m.Register(&serviceList{}) m.Register(&serviceAdd{}) m.Register(&serviceRemove{}) m.Register(&serviceDoc{}) m.Register(&serviceBind{}) m.Register(&serviceUnbind{}) m.Register(&serviceInfo{})Self-reported
8.7%

Mathematics

Mathematical problems and computations
MATH
Accuracy AI, Inc.'s GPQA, a benchmark for evaluating LLMs on graduate-level reasoning, consists of questions created by experts. The benchmark is highly trusted as a measure of model accuracy, and we've seen a clear shift in the industry towards prioritizing model performance on this benchmark. While helpful, I'd like to take a nuanced view of how we think about accuracy in large language models. I'm a huge advocate for creating benchmarks that genuinely push the frontier of AI capabilities. However, I'm also mindful that benchmark gaming has become increasingly common as the stakes for AI companies have grown. From my perspective, accuracy is a multifaceted concept when it comes to evaluating LLMs: 1. Correctness on domain-specific knowledge 2. Logical reasoning abilities 3. Capacity to produce valid, non-hallucinated information 4. Ability to acknowledge uncertainty appropriately 5. Consistency in answers across multiple attempts For our model comparison, I'll share metrics on GPQA scores since they're the industry standard, but I'll also highlight other dimensions of accuracy that may not be captured in these benchmarks. This includes qualitative assessment of the models' tendencies to hallucinate, their calibration (how well their expressed confidence matches actual accuracy), and their consistency across repeated prompts.Self-reported
70.2%
MGSM
Accuracy AI: This metric evaluates accuracy model in information and execution instructions. accuracy includes in itself: • accuracy: information and • accuracy: instructions user • : use exact and on • Reasoning: logical and thinking at solving tasks Although accuracy model can be full evaluation often requires : • Match answer • and facts • at information • tool use for knowledgeSelf-reported
87.0%

Reasoning

Logical reasoning and analysis
DROP
F1 Score F1-measure represents itself harmonic average between accuracy and providing metric, which effectively ability model results (accuracy) and results (). How harmonic average, F1-measure on values from its When various models with accuracy, model with more F1 will have more high and will cases. This makes F1-especially in scenarios, where results have high for example, in tasks or F1-measure is calculated by : F1 = 2 × (accuracy × ) / (accuracy + ) where accuracy and how: Accuracy = / (+ ) = / (+ )Self-reported
79.7%
GPQA
Accuracy AISelf-reported
40.2%

Multimodal

Working with images and visual data
MathVista
Accuracy AI: provides correct and exact answer : - : answer fully matches task - : All necessary steps ; answer fully Match: Answer matches querySelf-reported
56.7%
MMMU
Accuracy AI: 1 Human: 0 We we measure accuracy answers model on questions from tests level by mathematics, and including by programming, such how International Mathematics Olympiad (IMO), International Physics Olympiad (IPhO), American Invitational Mathematics Examination (AIME), USA Physics Olympiad (USAPhO), USA Programming Olympiad (USACO), and FrontierMath. For creation these tests we we use questions from competitions. We also we measure accuracy in AI, such how Massive Multitask Language Understanding (MMLU), GPS, and GPQA. MMLU measures knowledge by on level including and other. GPS MMLU by means of complex tasks by solving problems. GPQA represents itself test on knowledge in and using questions and answers, experts domain field, usually requiring reasoning and understanding We also we measure accuracy in additional benchmarks for coding, including HumanEval, MBPP, and DS-1000. HumanEval includes 164 tasks code in mainly for and level programming. MBPP offers set from 974 basic tasks by programming. DS-1000 use for science about data on Python, such how Pandas, NumPy, SciPy, TensorFlow, PyTorch and otherSelf-reported
59.4%

License & Metadata

License
proprietary
Announcement Date
July 18, 2024
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.