o1
A research preview model focused on mathematical and logical reasoning abilities, demonstrating improved performance on tasks requiring step-by-step reasoning, mathematical problem-solving, and code generation. The model shows extended formal reasoning capabilities while maintaining strong general abilities.
Key Specifications
Parameters
-
Context
200.0K
Release Date
December 17, 2024
Average Score
71.6%
Timeline
Key dates in the model's history
Announcement
December 17, 2024
Last Update
July 19, 2025
Today
March 26, 2026
Technical Specifications
Parameters
-
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$15.00
Output (per 1M tokens)
$60.00
Max Input Tokens
200.0K
Max Output Tokens
100.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
pass@1 for measurement queries, which model solves correctly for one attempt without options. This measurement "" abilities model answer on attempt. metric in mainly is applied to tasks, where exists correct answer or solution, for example, to mathematical tasks. pass@1 especially important for evaluation abilities model without help (for example, without use code or tools), and reflects, how well well model can answers directly at approach • Self-reported
Programming
Programming skills tests
HumanEval
In pass@1 we we evaluate model, allowing it generate answer only one times. Model simply answers on question and indicates her/its final answer, after we we evaluate this answer how correct or incorrect. This in evaluations (for example, MMLU, GPQA) • Self-reported
SWE-Bench Verified
AI: Translate this text: Embedding Token Retrieval Many language models offer an embedding API for mapping text to a vector representation, which can be used to measure semantic similarity between different texts. We leverage embedding models to "look up" tokens or approximate contexts inside the language model weight space. In our main implementation, we estimate the similarity of arbitrary tokens $x_{\text{in}}$ to $x_{\text{out}}$ with respect to a target model by embedding both tokens with the model's own embedding method and computing cosine similarity. For models where we do not have access to a "text → embedding" API, we construct a proxy metric using inputs of the form, "x_in is most similar to:" and averaging the log probabilities of candidate outputs following the prompt. The resulting "similarity metric" can be used in a variety of targeted editing techniques • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
pass@1 AI: that, that model will solve task for one attempt. When "pass@1" we we ask model solve task one times. Then we we verify, correct whether answer she/it In order to pass@1 for set tasks, we we compute model by all tasks • Self-reported
MATH
pass@1 with first attempts. This score is used for testing tasks, set possible solutions, but only one from them how correct (for example, generation code or solution mathematical tasks). When this model tries solve task only one times without additional attempts. This score reflects percentage correct answers with first attempts • Self-reported
MGSM
pass@1 This method measures proportion tasks, which model solves correctly with first attempts, when at is only one on solution. This strict metric, not several attempts or connection. In difference from methods, where model is provided several (for example, pass@k at k > 1), pass@1 requires success at indeed answer. This reflects ability model solve tasks without capabilities corrections or solutions. pass@1 especially for evaluation systems, which should work with first times in real scenarios, where no capabilities several options. In order to pass@1, is: 1. model set tasks 2. first answer on each task 3. Evaluation correctness each answer 4. correct answers • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
accuracy • Self-reported
Multimodal
Working with images and visual data
MathVista
with first attempts Chatbot: ChatGPT (gpt-4-turbo) Temperature: 1.0 • Self-reported
MMMU
pass@1 Pass with first times (pass@1) — this metric evaluation performance large language models in tasks type question-answer and solutions tasks, especially in context reasoning. In difference from comparison with reference answers, pass@1 measures probability that, that model correct answer with first attempts, without repeated attempts or iterations. This especially important for scenarios, where is required accuracy with first times. For computation pass@1 usually is used following method: 1. Models is provided assignment or question 2. first answer model (often with help experts or methods evaluation) 3. how proportion correct answers with first attempts by set tasks This metric is considered more strict, than metrics, several attempts, and better reflects model in • Self-reported
Other Tests
Specialized benchmarks
AIME 2024
accuracy • Self-reported
FrontierMath
pass@1 pass@1 - predict, which answer will correct with first attempts. For this metrics we we ask LLM generate several answers (usually 5-10) on one and indeed task, and then we ask model these answers by If best answer by evaluation model correct, we we consider, that model solved task with first attempts. pass@1 evaluates ability model not only find solution, but and correctly evaluate, which from solutions correct. This important in real scenarios use, where user obtain correct answer immediately, and not several options. from this metrics in that, that she/it ensures more evaluation capabilities model, since she/it at generation answer • Self-reported
GPQA Biology
Pass@1 — this measure abilities model solve task, when it is provided only one attempt. This most common use large language models in world, where usually is required correct answer with first times. In context tasks with answers (for example, tasks with multiple choice), pass@1 measures probability that, that model correct answer with first attempts. For tasks, requiring generation answer (for example, mathematical tasks or programming), pass@1 evaluates, how well often model generates solution with first attempts. Pass@1 represents itself metric evaluation, since not allows additional attempts or capabilities errors. She/It especially important in where reliability and accuracy have value, and where users on correct answer immediately • Self-reported
GPQA Chemistry
with first attempts AI: Translate following text on Russian language. all and on language • Self-reported
GPQA Physics
pass@1 This score measures, how well well model solves task with first attempts. For example, on task from several answers with options, if you model question one times and she/it answers correctly, then this pass@1 = 1.0. If model gives incorrect answer, then pass@1 = 0.0. We could would average value pass@1 by set tasks, in order to obtain general score pass@1 for this set tasks. This differs from other metrics: accuracy@k, which measures, whether correct answer in k model. pass@1 matches accuracy@1, but they not pass@1 measures ability model answer correctly for one attempt, in then time how accuracy@k allows several attempts • Self-reported
LiveBench
AI: I'll solve this coding problem step by step. First, let me understand what the problem is asking: - We need to implement a function that [problem description] - Input: [description of input format and constraints] - Output: [description of expected output] Let me think about the algorithm: 1. [First algorithmic step] 2. [Second algorithmic step] 3. [Third algorithmic step] Now, I'll implement the solution in code: ```python def solution(input_data): # Initialize variables result = [] # Process the input for item in input_data: # Apply the algorithm steps processed_item = process_item(item) result.append(processed_item) # Return the final result return result def process_item(item): # Implementation of processing logic return transformed_item ``` Let me test this solution with a few examples: - Example 1: [example input] → [expected output] - Example 2: [example input] → [expected output] Time complexity: O([complexity]) Space complexity: O([complexity]) The solution works correctly for all test cases • Self-reported
MMMLU
accuracy • Self-reported
SimpleQA
accuracy • Self-reported
TAU-bench Airline
Standard evaluation • Self-reported
TAU-bench Retail
Standard evaluation • Self-reported
License & Metadata
License
proprietary
Announcement Date
December 17, 2024
Last Updated
July 19, 2025
Similar Models
All ModelsGPT-4 Turbo
OpenAI
Best score:0.9 (HumanEval)
Released:Apr 2024
Price:$10.00/1M tokens
o1-mini
OpenAI
Best score:0.9 (HumanEval)
Released:Sep 2024
Price:$3.00/1M tokens
o1-preview
OpenAI
Best score:0.9 (MMLU)
Released:Sep 2024
Price:$15.00/1M tokens
GPT-5 Codex
OpenAI
Released:Sep 2025
Price:$2.00/1M tokens
o3-mini
OpenAI
Best score:0.9 (MMLU)
Released:Jan 2025
Price:$1.10/1M tokens
GPT-3.5 Turbo
OpenAI
Best score:0.7 (MMLU)
Released:Mar 2023
Price:$0.50/1M tokens
GPT-4o mini
OpenAI
MM
Best score:0.9 (HumanEval)
Released:Jul 2024
Price:$0.15/1M tokens
GPT-5 nano
OpenAI
MM
Best score:0.7 (GPQA)
Released:Aug 2025
Price:$0.05/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.