o1

Name: o1
Author: OpenAI

OpenAI

A research preview model focused on mathematical and logical reasoning abilities, demonstrating improved performance on tasks requiring step-by-step reasoning, mathematical problem-solving, and code generation. The model shows extended formal reasoning capabilities while maintaining strong general abilities.

Key Specifications

Parameters

Context

200.0K

Release Date

December 17, 2024

Average Score

71.6%

API Documentation Research Paper Repository Results Blog

Timeline

Key dates in the model's history

Announcement

December 17, 2024

Last Update

July 19, 2025

Today

July 7, 2026

Technical Specifications

Parameters

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$15.00

Output (per 1M tokens)

$60.00

Max Input Tokens

200.0K

Max Output Tokens

100.0K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

pass@1 for measurement queries, which model solves correctly for one attempt without options. This measurement "" abilities model answer on attempt. metric in mainly is applied to tasks, where exists correct answer or solution, for example, to mathematical tasks. pass@1 especially important for evaluation abilities model without help (for example, without use code or tools), and reflects, how well well model can answers directly at approach • Self-reported

91.8%

Programming

Programming skills tests

HumanEval

In pass@1 we we evaluate model, allowing it generate answer only one times. Model simply answers on question and indicates her/its final answer, after we we evaluate this answer how correct or incorrect. This in evaluations (for example, MMLU, GPQA) • Self-reported

88.1%

SWE-Bench Verified

AI: Translate this text: Embedding Token Retrieval Many language models offer an embedding API for mapping text to a vector representation, which can be used to measure semantic similarity between different texts. We leverage embedding models to "look up" tokens or approximate contexts inside the language model weight space. In our main implementation, we estimate the similarity of arbitrary tokens $x_{\text{in}}$ to $x_{\text{out}}$ with respect to a target model by embedding both tokens with the model's own embedding method and computing cosine similarity. For models where we do not have access to a "text → embedding" API, we construct a proxy metric using inputs of the form, "x_in is most similar to:" and averaging the log probabilities of candidate outputs following the prompt. The resulting "similarity metric" can be used in a variety of targeted editing techniques • Self-reported

41.0%

Mathematics

Mathematical problems and computations

GSM8k

pass@1 AI: that, that model will solve task for one attempt. When "pass@1" we we ask model solve task one times. Then we we verify, correct whether answer she/it In order to pass@1 for set tasks, we we compute model by all tasks • Self-reported

97.1%

MATH

pass@1 with first attempts. This score is used for testing tasks, set possible solutions, but only one from them how correct (for example, generation code or solution mathematical tasks). When this model tries solve task only one times without additional attempts. This score reflects percentage correct answers with first attempts • Self-reported

96.4%

MGSM

pass@1 This method measures proportion tasks, which model solves correctly with first attempts, when at is only one on solution. This strict metric, not several attempts or connection. In difference from methods, where model is provided several (for example, pass@k at k > 1), pass@1 requires success at indeed answer. This reflects ability model solve tasks without capabilities corrections or solutions. pass@1 especially for evaluation systems, which should work with first times in real scenarios, where no capabilities several options. In order to pass@1, is: 1. model set tasks 2. first answer on each task 3. Evaluation correctness each answer 4. correct answers • Self-reported

89.3%

Reasoning

Logical reasoning and analysis

GPQA

accuracy • Self-reported

78.0%

Multimodal

Working with images and visual data

MathVista

with first attempts Chatbot: ChatGPT (gpt-4-turbo) Temperature: 1.0 • Self-reported

71.8%

MMMU

pass@1 Pass with first times (pass@1) — this metric evaluation performance large language models in tasks type question-answer and solutions tasks, especially in context reasoning. In difference from comparison with reference answers, pass@1 measures probability that, that model correct answer with first attempts, without repeated attempts or iterations. This especially important for scenarios, where is required accuracy with first times. For computation pass@1 usually is used following method: 1. Models is provided assignment or question 2. first answer model (often with help experts or methods evaluation) 3. how proportion correct answers with first attempts by set tasks This metric is considered more strict, than metrics, several attempts, and better reflects model in • Self-reported

77.6%

Other Tests

Specialized benchmarks

AIME 2024

accuracy • Self-reported

74.3%

FrontierMath

pass@1 pass@1 - predict, which answer will correct with first attempts. For this metrics we we ask LLM generate several answers (usually 5-10) on one and indeed task, and then we ask model these answers by If best answer by evaluation model correct, we we consider, that model solved task with first attempts. pass@1 evaluates ability model not only find solution, but and correctly evaluate, which from solutions correct. This important in real scenarios use, where user obtain correct answer immediately, and not several options. from this metrics in that, that she/it ensures more evaluation capabilities model, since she/it at generation answer • Self-reported

5.5%

GPQA Biology

Pass@1 — this measure abilities model solve task, when it is provided only one attempt. This most common use large language models in world, where usually is required correct answer with first times. In context tasks with answers (for example, tasks with multiple choice), pass@1 measures probability that, that model correct answer with first attempts. For tasks, requiring generation answer (for example, mathematical tasks or programming), pass@1 evaluates, how well often model generates solution with first attempts. Pass@1 represents itself metric evaluation, since not allows additional attempts or capabilities errors. She/It especially important in where reliability and accuracy have value, and where users on correct answer immediately • Self-reported

69.2%

GPQA Chemistry

with first attempts AI: Translate following text on Russian language. all and on language • Self-reported

64.7%

GPQA Physics

pass@1 This score measures, how well well model solves task with first attempts. For example, on task from several answers with options, if you model question one times and she/it answers correctly, then this pass@1 = 1.0. If model gives incorrect answer, then pass@1 = 0.0. We could would average value pass@1 by set tasks, in order to obtain general score pass@1 for this set tasks. This differs from other metrics: accuracy@k, which measures, whether correct answer in k model. pass@1 matches accuracy@1, but they not pass@1 measures ability model answer correctly for one attempt, in then time how accuracy@k allows several attempts • Self-reported

92.8%

LiveBench

AI: I'll solve this coding problem step by step. First, let me understand what the problem is asking: - We need to implement a function that [problem description] - Input: [description of input format and constraints] - Output: [description of expected output] Let me think about the algorithm: 1. [First algorithmic step] 2. [Second algorithmic step] 3. [Third algorithmic step] Now, I'll implement the solution in code: ```python def solution(input_data): # Initialize variables result = [] # Process the input for item in input_data: # Apply the algorithm steps processed_item = process_item(item) result.append(processed_item) # Return the final result return result def process_item(item): # Implementation of processing logic return transformed_item ``` Let me test this solution with a few examples: - Example 1: [example input] → [expected output] - Example 2: [example input] → [expected output] Time complexity: O([complexity]) Space complexity: O([complexity]) The solution works correctly for all test cases • Self-reported

67.0%

MMMLU

accuracy • Self-reported

87.7%

SimpleQA

accuracy • Self-reported

47.0%

TAU-bench Airline

Standard evaluation • Self-reported

50.0%

TAU-bench Retail

Standard evaluation • Self-reported

70.8%

License & Metadata

License

proprietary

Announcement Date

December 17, 2024

Last Updated

July 19, 2025

Similar Models

All Models

GPT-4 Turbo

OpenAI

Best score:0.9 (HumanEval)

Released:Apr 2024

Price:$10.00/1M tokens

o1-mini

OpenAI

Best score:0.9 (HumanEval)

Released:Sep 2024

Price:$3.00/1M tokens

o1-preview

OpenAI

Best score:0.9 (MMLU)

Released:Sep 2024

Price:$15.00/1M tokens

GPT-5 Codex

OpenAI

Released:Sep 2025

Price:$2.00/1M tokens

o3-mini

OpenAI

Best score:0.9 (MMLU)

Released:Jan 2025

Price:$1.10/1M tokens

GPT-3.5 Turbo

OpenAI

Best score:0.7 (MMLU)

Released:Mar 2023

Price:$0.50/1M tokens

GPT-4o mini

OpenAI

Best score:0.9 (HumanEval)

Released:Jul 2024

Price:$0.15/1M tokens

GPT-5 nano

OpenAI

Best score:0.7 (GPQA)

Released:Aug 2025

Price:$0.05/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.