Key Specifications
Parameters
-
Context
128.0K
Release Date
September 12, 2024
Average Score
71.9%
Timeline
Key dates in the model's history
Announcement
September 12, 2024
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$3.00
Output (per 1M tokens)
$12.00
Max Input Tokens
128.0K
Max Output Tokens
65.5K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
Reasoning by chain with example AI: I'll analyze this problem step by step. For step n, f(n) represents the total number of jumps to reach n. Base cases: f(0) = 0 (already at the start) f(1) = 1 (only one way to reach 1, jumping from 0) For n ≥ 2, we can reach n by jumping from n-1 or n-2. So f(n) = f(n-1) + f(n-2) This gives us the Fibonacci sequence. f(2) = f(1) + f(0) = 1 + 0 = 1 f(3) = f(2) + f(1) = 1 + 1 = 2 f(4) = f(3) + f(2) = 2 + 1 = 3 f(5) = f(4) + f(3) = 3 + 2 = 5 f(6) = f(5) + f(4) = 5 + 3 = 8 f(7) = f(6) + f(5) = 8 + 5 = 13 f(8) = f(7) + f(6) = 13 + 8 = 21 f(9) = f(8) + f(7) = 21 + 13 = 34 f(10) = f(9) + f(8) = 34 + 21 = 55 The answer is 55 • Self-reported
Programming
Programming skills tests
HumanEval
Accuracy Pass@1 Accuracy Pass@1 — this percentage solved tasks at first attempt. We one solution for each tasks and we verify its. If this solution correct, we we consider task This score especially useful for scenarios, where model should give correct answer with first attempts. However he not accounts for ability model correct errors through several attempts, that can in with model in capacity by programming • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
0-shot Chain of Thought Method (Diamond) offers improvement for process thinking models. consists in that, in order to structure thinking model in form "", with on task, then field thinking and various approaches, and finally in order to on answer. Approach uses example (0-shot) method chains reasoning (Chain of Thought), in order to model reasoning without demonstration examples. This method especially useful for complex tasks, requiring analysis and solutions • Self-reported
Other Tests
Specialized benchmarks
Cybersecurity CTFs
Pass@12 accuracy This metric measures efficiency solutions tasks coding, evaluating, can whether model correctly solve problem although would one times for 12 attempts (or number attempts). This more way evaluation, than measurement accuracy with first attempts, and he better reflects use, when users can several solutions and When Pass@k: - Model generates n solutions for tasks - them manner k solutions - is considered if although would one from k solutions works correctly Usually are used such how Pass@1, Pass@10 or Pass@100. if at model is probability p solve task for one attempt, then probability solve her/its although would one times for k attempts 1-(1-p)^k • Self-reported
MATH-500
0-shot Chain of Thought
AI: 0-shot chain thinking • Self-reported
SuperGLUE
Evaluation on set AI: On validation set of ~400 problems, my model gets ~78% of the problems correct. This is a substantial increase over model baselines I am comparing against, which get ~55% to ~70% of problems correct. It's important to evaluate carefully. I follow 3 rules in my evaluation: 1. The solution must be correct. For problems with numerical or simple symbolic answers (e.g. "x = 5" or "72 degrees"), I check if the answer is present at the end of the model's solution. For problems with more complex symbolic answers, I manually check if the solution is correct. 2. The solution must have no hallucinations or made-up facts. I manually review all examples in my validation set to ensure the chain-of-thought is correct. 3. I avoid problems that might have appeared in my model's training data. I source most of my problems from recent competitions, or create them myself. This ensures my model is not simply memorizing answers. Humans have always been the gold standard. I show that with the right methods, AI can demonstrate similar abilities and clear reasoning. I include a discussion of my model's failure cases to highlight where it still falls short • Self-reported
License & Metadata
License
proprietary
Announcement Date
September 12, 2024
Last Updated
July 19, 2025
Articles about o1-mini
Similar Models
All ModelsGPT-4 Turbo
OpenAI
Best score:0.9 (HumanEval)
Released:Apr 2024
Price:$10.00/1M tokens
o1
OpenAI
Best score:0.9 (MMLU)
Released:Dec 2024
Price:$15.00/1M tokens
o1-preview
OpenAI
Best score:0.9 (MMLU)
Released:Sep 2024
Price:$15.00/1M tokens
GPT-5 Codex
OpenAI
Released:Sep 2025
Price:$2.00/1M tokens
o3-mini
OpenAI
Best score:0.9 (MMLU)
Released:Jan 2025
Price:$1.10/1M tokens
GPT-3.5 Turbo
OpenAI
Best score:0.7 (MMLU)
Released:Mar 2023
Price:$0.50/1M tokens
o3
OpenAI
MM
Best score:0.8 (GPQA)
Released:Apr 2025
Price:$2.00/1M tokens
GPT-4.5
OpenAI
MM
Best score:0.9 (MMLU)
Released:Feb 2025
Price:$75.00/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.
