OpenAI logo

o1

OpenAI

A research preview model focused on mathematical and logical reasoning abilities, demonstrating improved performance on tasks requiring step-by-step reasoning, mathematical problem-solving, and code generation. The model shows extended formal reasoning capabilities while maintaining strong general abilities.

Key Specifications

Parameters
-
Context
200.0K
Release Date
December 17, 2024
Average Score
71.6%

Timeline

Key dates in the model's history
Announcement
December 17, 2024
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
-
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$15.00
Output (per 1M tokens)
$60.00
Max Input Tokens
200.0K
Max Output Tokens
100.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
pass@1 for measurement queries, which model solves correctly for one attempt without options. This measurement "" abilities model answer on attempt. metric in mainly is applied to tasks, where exists correct answer or solution, for example, to mathematical tasks. pass@1 especially important for evaluation abilities model without help (for example, without use code or tools), and reflects, how well well model can answers directly at approachSelf-reported
91.8%

Programming

Programming skills tests
HumanEval
In pass@1 we we evaluate model, allowing it generate answer only one times. Model simply answers on question and indicates her/its final answer, after we we evaluate this answer how correct or incorrect. This in evaluations (for example, MMLU, GPQA)Self-reported
88.1%
SWE-Bench Verified
AI: Translate this text: Embedding Token Retrieval Many language models offer an embedding API for mapping text to a vector representation, which can be used to measure semantic similarity between different texts. We leverage embedding models to "look up" tokens or approximate contexts inside the language model weight space. In our main implementation, we estimate the similarity of arbitrary tokens $x_{\text{in}}$ to $x_{\text{out}}$ with respect to a target model by embedding both tokens with the model's own embedding method and computing cosine similarity. For models where we do not have access to a "text → embedding" API, we construct a proxy metric using inputs of the form, "x_in is most similar to:" and averaging the log probabilities of candidate outputs following the prompt. The resulting "similarity metric" can be used in a variety of targeted editing techniquesSelf-reported
41.0%

Mathematics

Mathematical problems and computations
GSM8k
pass@1 AI: that, that model will solve task for one attempt. When "pass@1" we we ask model solve task one times. Then we we verify, correct whether answer she/it In order to pass@1 for set tasks, we we compute model by all tasksSelf-reported
97.1%
MATH
pass@1 with first attempts. This score is used for testing tasks, set possible solutions, but only one from them how correct (for example, generation code or solution mathematical tasks). When this model tries solve task only one times without additional attempts. This score reflects percentage correct answers with first attemptsSelf-reported
96.4%
MGSM
pass@1 This method measures proportion tasks, which model solves correctly with first attempts, when at is only one on solution. This strict metric, not several attempts or connection. In difference from methods, where model is provided several (for example, pass@k at k > 1), pass@1 requires success at indeed answer. This reflects ability model solve tasks without capabilities corrections or solutions. pass@1 especially for evaluation systems, which should work with first times in real scenarios, where no capabilities several options. In order to pass@1, is: 1. model set tasks 2. first answer on each task 3. Evaluation correctness each answer 4. correct answersSelf-reported
89.3%

Reasoning

Logical reasoning and analysis
GPQA
accuracySelf-reported
78.0%

Multimodal

Working with images and visual data
MathVista
with first attempts Chatbot: ChatGPT (gpt-4-turbo) Temperature: 1.0Self-reported
71.8%
MMMU
pass@1 Pass with first times (pass@1) — this metric evaluation performance large language models in tasks type question-answer and solutions tasks, especially in context reasoning. In difference from comparison with reference answers, pass@1 measures probability that, that model correct answer with first attempts, without repeated attempts or iterations. This especially important for scenarios, where is required accuracy with first times. For computation pass@1 usually is used following method: 1. Models is provided assignment or question 2. first answer model (often with help experts or methods evaluation) 3. how proportion correct answers with first attempts by set tasks This metric is considered more strict, than metrics, several attempts, and better reflects model inSelf-reported
77.6%

Other Tests

Specialized benchmarks
AIME 2024
accuracySelf-reported
74.3%
FrontierMath
pass@1 pass@1 - predict, which answer will correct with first attempts. For this metrics we we ask LLM generate several answers (usually 5-10) on one and indeed task, and then we ask model these answers by If best answer by evaluation model correct, we we consider, that model solved task with first attempts. pass@1 evaluates ability model not only find solution, but and correctly evaluate, which from solutions correct. This important in real scenarios use, where user obtain correct answer immediately, and not several options. from this metrics in that, that she/it ensures more evaluation capabilities model, since she/it at generation answerSelf-reported
5.5%
GPQA Biology
Pass@1 — this measure abilities model solve task, when it is provided only one attempt. This most common use large language models in world, where usually is required correct answer with first times. In context tasks with answers (for example, tasks with multiple choice), pass@1 measures probability that, that model correct answer with first attempts. For tasks, requiring generation answer (for example, mathematical tasks or programming), pass@1 evaluates, how well often model generates solution with first attempts. Pass@1 represents itself metric evaluation, since not allows additional attempts or capabilities errors. She/It especially important in where reliability and accuracy have value, and where users on correct answer immediatelySelf-reported
69.2%
GPQA Chemistry
with first attempts AI: Translate following text on Russian language. all and on languageSelf-reported
64.7%
GPQA Physics
pass@1 This score measures, how well well model solves task with first attempts. For example, on task from several answers with options, if you model question one times and she/it answers correctly, then this pass@1 = 1.0. If model gives incorrect answer, then pass@1 = 0.0. We could would average value pass@1 by set tasks, in order to obtain general score pass@1 for this set tasks. This differs from other metrics: accuracy@k, which measures, whether correct answer in k model. pass@1 matches accuracy@1, but they not pass@1 measures ability model answer correctly for one attempt, in then time how accuracy@k allows several attemptsSelf-reported
92.8%
LiveBench
AI: I'll solve this coding problem step by step. First, let me understand what the problem is asking: - We need to implement a function that [problem description] - Input: [description of input format and constraints] - Output: [description of expected output] Let me think about the algorithm: 1. [First algorithmic step] 2. [Second algorithmic step] 3. [Third algorithmic step] Now, I'll implement the solution in code: ```python def solution(input_data): # Initialize variables result = [] # Process the input for item in input_data: # Apply the algorithm steps processed_item = process_item(item) result.append(processed_item) # Return the final result return result def process_item(item): # Implementation of processing logic return transformed_item ``` Let me test this solution with a few examples: - Example 1: [example input] → [expected output] - Example 2: [example input] → [expected output] Time complexity: O([complexity]) Space complexity: O([complexity]) The solution works correctly for all test casesSelf-reported
67.0%
MMMLU
accuracySelf-reported
87.7%
SimpleQA
accuracySelf-reported
47.0%
TAU-bench Airline
Standard evaluationSelf-reported
50.0%
TAU-bench Retail
Standard evaluationSelf-reported
70.8%

License & Metadata

License
proprietary
Announcement Date
December 17, 2024
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.