GPT OSS 20B

Name: GPT OSS 20B
Author: OpenAI

Multimodal

OpenAI

The gpt-oss-20b model achieves near-parity with OpenAI o4-mini on major reasoning benchmarks while efficiently running on a single GPU with 80 GB of memory. The gpt-oss-20b model shows results comparable to OpenAI o3-mini on common benchmarks and can run on edge devices with as little as 16 GB of memory, making it ideal for on-device use, local inference, or fast iteration without expensive infrastructure. Both models also demonstrate strong performance in tool use, few-shot function calling, CoT reasoning (as seen in results on the agentic evaluation suite Tau-Bench), and HealthBench (even outperforming proprietary models such as OpenAI o1 and GPT-4o).

Key Specifications

Parameters

20.0B

Context

131.0K

Release Date

August 5, 2025

Average Score

37.8%

Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement / Last Update

August 5, 2025

Today

July 6, 2026

Technical Specifications

Parameters

20.0B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.10

Output (per 1M tokens)

$0.50

Max Input Tokens

131.0K

Max Output Tokens

30.0K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU benchmark

Without tools • Self-reported

85.3%

Reasoning

Logical reasoning and analysis

GPQA

Diamond (without tools) • Self-reported

71.5%

Other Tests

Specialized benchmarks

Codeforces Competition code

Elo (with ) AI: LLM with capability of reasoning and using appropriate tools to solve tasks. We test it within competitive scenarios using the Elo rating system. In our evaluations, we run a large set of matchups between models on problems of different difficulties across datasets. To quantify overall model performance, we use Elo ratings, which are typically used to rank two-player games like chess and Go, where each player (in our case, model) has an Elo score. We can use the difference in Elo scores to predict the chance of a model winning against another on a given problem. In traditional Elo, when model A and model B go head-to-head, a model wins if it gets the correct answer while the other gets an incorrect answer. If both models get the same outcome (both correct or both incorrect), the result is a tie. To compute Elo scores, we use a logistic model where the probability of model A (with rating rₐ) winning against model B (with rating rᵦ) is: P(A beats B) = 1 / (1 + 10^((rᵦ - rₐ) / 400)) After each matchup, we update the Elo ratings based on the expected and actual outcomes. The magnitude of the update is controlled by a K-factor, which we set to 4 based on tuning to our data • Self-reported

25.2%

Codeforces Competition code

Elo (without tools) • Self-reported

22.3%

Humanity's Last Exam

Accuracy (with ) AI: on mathematical questions, how in so and in mode thinking, I tools, which in This includes Python for computations and verification reasoning (especially for tasks), Sage for computations, and Wolfram Alpha for verification answers or obtaining help. I to task step by step, first task and her/its on components, then corresponding mathematical tools and, finally, solution with help methods, when this possible • Self-reported

17.3%

Humanity's Last Exam

Accuracy (without tools) • Self-reported

10.9%

HealthBench - Realistic health conversations

Score • Self-reported

42.5%

HealthBench Hard - Challenging health conversations

Score AI: A model that can solve a problem will typically achieve the correct result. But what about models that cannot fully solve a given problem, or models that may have made a mistake during the solution? A simple binary metric that only checks if the model got the final answer right doesn't provide much insight into how the model is reasoning or where it might be going wrong. The Score metric aims to address this limitation by evaluating not just whether the final answer is correct, but how well the model reasoned throughout its solution attempt. This provides a more nuanced view of model performance and helps identify specific weaknesses in reasoning capabilities. • Self-reported

10.8%

TAU-bench Retail benchmark

functions AI: Function calling, a crucial method for integrating language models with external tools and services, represents a significant advance in AI capability. In function calling, a language model parses a request, determines that a specific external function should be invoked, and formats the necessary parameters for that function in a structured format (typically JSON). The implementation can vary across models and platforms. For example, OpenAI's API allows developers to define function schemas that the model can reference, while open-source implementations like LangChain provide frameworks to handle the execution of identified functions. When a model employs function calling, it typically: 1. Recognizes when a request requires external computation or data 2. Selects the appropriate function based on the need 3. Structures the required arguments correctly 4. Generates proper syntax for the function call This capability transforms language models from pure text generators into systems that can trigger specific actions in software applications, query databases, or interact with external APIs. The model doesn't execute the function itself but rather identifies when a function should be called and prepares the call appropriately. Function calling is particularly valuable for: - Retrieving real-time information - Performing calculations - Executing database operations - Interfacing with external services - Controlling application features The ability to properly identify when function calling is needed (versus handling a request directly) and to correctly format the required parameters represents a sophisticated form of reasoning that bridges natural language understanding and programmatic execution • Self-reported

54.8%

License & Metadata

License

apache_2_0

Announcement Date

August 5, 2025

Last Updated

August 5, 2025

Similar Models

All Models

GPT-4o

OpenAI

Best score:0.9 (MMLU)

Released:Aug 2024

Price:$2.50/1M tokens

o4-mini

OpenAI

Best score:0.8 (GPQA)

Released:Apr 2025

Price:$1.10/1M tokens

GPT-4.1

OpenAI

Best score:0.9 (MMLU)

Released:Apr 2025

Price:$2.00/1M tokens

GPT-4o mini

OpenAI

Best score:0.9 (HumanEval)

Released:Jul 2024

Price:$0.15/1M tokens

GPT-4.5

OpenAI

Best score:0.9 (MMLU)

Released:Feb 2025

Price:$75.00/1M tokens

GPT-5 nano

OpenAI

Best score:0.7 (GPQA)

Released:Aug 2025

Price:$0.05/1M tokens

GPT-4

OpenAI

Best score:1.0 (ARC)

Released:Jun 2023

Price:$30.00/1M tokens

GPT-4o

OpenAI

Best score:0.9 (HumanEval)

Released:May 2024

Price:$2.50/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.