OpenAI logo

GPT OSS 120B

Multimodal
OpenAI

The gpt-oss-120b model achieves near-parity with OpenAI o4-mini on major reasoning benchmarks while efficiently running on a single GPU with 80 GB of memory. The gpt-oss-20b model shows results comparable to OpenAI o3-mini on common benchmarks and can run on edge devices with as little as 16 GB of memory, making it ideal for on-device use, local inference, or fast iteration without expensive infrastructure. Both models also demonstrate strong performance in tool use, few-shot function calling, CoT reasoning (as seen in results on the agentic evaluation suite Tau-Bench), and HealthBench (even outperforming proprietary models such as OpenAI o1 and GPT-4o).

Key Specifications

Parameters
120.0B
Context
131.0K
Release Date
August 5, 2025
Average Score
45.6%

Timeline

Key dates in the model's history
Announcement / Last Update
August 5, 2025
Today
March 25, 2026

Technical Specifications

Parameters
120.0B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.15
Output (per 1M tokens)
$0.60
Max Input Tokens
131.0K
Max Output Tokens
30.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU benchmark
Without toolsSelf-reported
90.0%

Reasoning

Logical reasoning and analysis
GPQA
Without toolsSelf-reported
80.1%

Other Tests

Specialized benchmarks
Codeforces Competition code
Elo (with ) AI: for comparison in question-answers, in order to obtain rating model. : We questions from then these questions two models AI, and was choose, which answer they What we used for evaluation: questions from GPQA. method: We used method approach for evaluation AI-models. answers from two models on one and that indeed question and which answer was better. We used from 1000 questions with GPQA. We models capability tool use, in from set GPT. This standard in GPT, with help python_calculator. We prompts for all models, in order to they were in form. We for to each model only one prompt with results. In we on These were with using model, with rating how in previous researchSelf-reported
26.2%
Codeforces Competition code
Elo (without tools)Self-reported
24.6%
Humanity's Last Exam
Accuracy (with ) AI: I'll solve this using algebraic calculations. First, let's identify the variables: - The radius of the sphere is r = 4 meters - The radius of the cone is r = 4 meters - The height of the cone is h = 4 meters For a sphere, the volume is V = (4/3)πr³ V_sphere = (4/3)π(4³) = (4/3)π(64) = (256/3)π cubic meters For a cone, the volume is V = (1/3)πr²h V_cone = (1/3)π(4²)(4) = (1/3)π(16)(4) = (64/3)π cubic meters The ratio of the volume of the sphere to the volume of the cone is: V_sphere / V_cone = ((256/3)π) / ((64/3)π) = 256/64 = 4 Therefore, the ratio of the volume of the sphere to the volume of the cone is 4:1Self-reported
19.0%
Humanity's Last Exam
Accuracy (without tools)Self-reported
14.9%
HealthBench - Realistic health conversations
ScoreSelf-reported
57.6%
HealthBench Hard - Challenging health conversations
Score Evaluation work LLM in on reasoning how requires many However not always need to reasoning model, especially when it sufficiently answer, and they how well he sufficiently important following evaluation: final answer should be and (for example, number or ), and not or If answer in (for example, evaluation probability, in ), should use rules evaluation for tasks in "". Evaluations can be or in dependency from context. evaluation gives full score, only if answer exactly matches with reference solution. evaluation can give for answers, which indicate on correct but contain errors, if model demonstrates understanding main concepts. evaluation suits for questions with correct answer, how evaluation can be more for tasks, where important process reasoning or where various approachesSelf-reported
30.0%
TAU-bench Retail benchmark
Function calling AI: Translate on Russian language following text method analysis. ONLY translation, without quotes, without without explanations. Function callingSelf-reported
67.8%

License & Metadata

License
apache_2_0
Announcement Date
August 5, 2025
Last Updated
August 5, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.