GPT OSS 120B

Name: GPT OSS 120B
Author: OpenAI

Multimodal

OpenAI

The gpt-oss-120b model achieves near-parity with OpenAI o4-mini on major reasoning benchmarks while efficiently running on a single GPU with 80 GB of memory. The gpt-oss-20b model shows results comparable to OpenAI o3-mini on common benchmarks and can run on edge devices with as little as 16 GB of memory, making it ideal for on-device use, local inference, or fast iteration without expensive infrastructure. Both models also demonstrate strong performance in tool use, few-shot function calling, CoT reasoning (as seen in results on the agentic evaluation suite Tau-Bench), and HealthBench (even outperforming proprietary models such as OpenAI o1 and GPT-4o).

Key Specifications

Parameters

120.0B

Context

131.0K

Release Date

August 5, 2025

Average Score

45.6%

Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement / Last Update

August 5, 2025

Today

May 9, 2026

Technical Specifications

Parameters

120.0B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.15

Output (per 1M tokens)

$0.60

Max Input Tokens

131.0K

Max Output Tokens

30.0K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU benchmark

Without tools • Self-reported

90.0%

Reasoning

Logical reasoning and analysis

GPQA

Without tools • Self-reported

80.1%

Other Tests

Specialized benchmarks

Codeforces Competition code

Elo (with ) AI: for comparison in question-answers, in order to obtain rating model. : We questions from then these questions two models AI, and was choose, which answer they What we used for evaluation: questions from GPQA. method: We used method approach for evaluation AI-models. answers from two models on one and that indeed question and which answer was better. We used from 1000 questions with GPQA. We models capability tool use, in from set GPT. This standard in GPT, with help python_calculator. We prompts for all models, in order to they were in form. We for to each model only one prompt with results. In we on These were with using model, with rating how in previous research • Self-reported

26.2%

Codeforces Competition code

Elo (without tools) • Self-reported

24.6%

Humanity's Last Exam

Accuracy (with ) AI: I'll solve this using algebraic calculations. First, let's identify the variables: - The radius of the sphere is r = 4 meters - The radius of the cone is r = 4 meters - The height of the cone is h = 4 meters For a sphere, the volume is V = (4/3)πr³ V_sphere = (4/3)π(4³) = (4/3)π(64) = (256/3)π cubic meters For a cone, the volume is V = (1/3)πr²h V_cone = (1/3)π(4²)(4) = (1/3)π(16)(4) = (64/3)π cubic meters The ratio of the volume of the sphere to the volume of the cone is: V_sphere / V_cone = ((256/3)π) / ((64/3)π) = 256/64 = 4 Therefore, the ratio of the volume of the sphere to the volume of the cone is 4:1 • Self-reported

19.0%

Humanity's Last Exam

Accuracy (without tools) • Self-reported

14.9%

HealthBench - Realistic health conversations

Score • Self-reported

57.6%

HealthBench Hard - Challenging health conversations

Score Evaluation work LLM in on reasoning how requires many However not always need to reasoning model, especially when it sufficiently answer, and they how well he sufficiently important following evaluation: final answer should be and (for example, number or ), and not or If answer in (for example, evaluation probability, in ), should use rules evaluation for tasks in "". Evaluations can be or in dependency from context. evaluation gives full score, only if answer exactly matches with reference solution. evaluation can give for answers, which indicate on correct but contain errors, if model demonstrates understanding main concepts. evaluation suits for questions with correct answer, how evaluation can be more for tasks, where important process reasoning or where various approaches • Self-reported

30.0%

TAU-bench Retail benchmark

Function calling AI: Translate on Russian language following text method analysis. ONLY translation, without quotes, without without explanations. Function calling • Self-reported

67.8%

License & Metadata

License

apache_2_0

Announcement Date

August 5, 2025

Last Updated

August 5, 2025

Similar Models

All Models

GPT-5 mini

OpenAI

Best score:0.8 (GPQA)

Released:Aug 2025

Price:$0.25/1M tokens

GPT-5 Medium

OpenAI

Best score:0.9 (GPQA)

Released:Aug 2025

Price:$0.75/1M tokens

GPT-5.4 nano

OpenAI

Best score:0.9 (TAU)

Released:Mar 2026

Price:$0.12/1M tokens

GPT-5

OpenAI

Best score:0.9 (HumanEval)

Released:Aug 2025

Price:$1.25/1M tokens

GPT-5.1 High

OpenAI

Best score:0.9 (GPQA)

Released:Nov 2025

Price:$2.00/1M tokens

GPT-5 High

OpenAI

Best score:0.9 (GPQA)

Released:Aug 2025

Price:$2.00/1M tokens

GPT-4o

OpenAI

Best score:0.9 (HumanEval)

Released:May 2024

Price:$2.50/1M tokens

o3

OpenAI

Best score:0.8 (GPQA)

Released:Apr 2025

Price:$2.00/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.