OpenAI logo

GPT OSS 20B

Multimodal
OpenAI

The gpt-oss-20b model achieves near-parity with OpenAI o4-mini on major reasoning benchmarks while efficiently running on a single GPU with 80 GB of memory. The gpt-oss-20b model shows results comparable to OpenAI o3-mini on common benchmarks and can run on edge devices with as little as 16 GB of memory, making it ideal for on-device use, local inference, or fast iteration without expensive infrastructure. Both models also demonstrate strong performance in tool use, few-shot function calling, CoT reasoning (as seen in results on the agentic evaluation suite Tau-Bench), and HealthBench (even outperforming proprietary models such as OpenAI o1 and GPT-4o).

Key Specifications

Parameters
20.0B
Context
131.0K
Release Date
August 5, 2025
Average Score
37.8%

Timeline

Key dates in the model's history
Announcement / Last Update
August 5, 2025
Today
March 25, 2026

Technical Specifications

Parameters
20.0B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.10
Output (per 1M tokens)
$0.50
Max Input Tokens
131.0K
Max Output Tokens
30.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU benchmark
Without toolsSelf-reported
85.3%

Reasoning

Logical reasoning and analysis
GPQA
Diamond (without tools)Self-reported
71.5%

Other Tests

Specialized benchmarks
Codeforces Competition code
Elo (with ) AI: LLM with capability of reasoning and using appropriate tools to solve tasks. We test it within competitive scenarios using the Elo rating system. In our evaluations, we run a large set of matchups between models on problems of different difficulties across datasets. To quantify overall model performance, we use Elo ratings, which are typically used to rank two-player games like chess and Go, where each player (in our case, model) has an Elo score. We can use the difference in Elo scores to predict the chance of a model winning against another on a given problem. In traditional Elo, when model A and model B go head-to-head, a model wins if it gets the correct answer while the other gets an incorrect answer. If both models get the same outcome (both correct or both incorrect), the result is a tie. To compute Elo scores, we use a logistic model where the probability of model A (with rating rₐ) winning against model B (with rating rᵦ) is: P(A beats B) = 1 / (1 + 10^((rᵦ - rₐ) / 400)) After each matchup, we update the Elo ratings based on the expected and actual outcomes. The magnitude of the update is controlled by a K-factor, which we set to 4 based on tuning to our dataSelf-reported
25.2%
Codeforces Competition code
Elo (without tools)Self-reported
22.3%
Humanity's Last Exam
Accuracy (with ) AI: on mathematical questions, how in so and in mode thinking, I tools, which in This includes Python for computations and verification reasoning (especially for tasks), Sage for computations, and Wolfram Alpha for verification answers or obtaining help. I to task step by step, first task and her/its on components, then corresponding mathematical tools and, finally, solution with help methods, when this possibleSelf-reported
17.3%
Humanity's Last Exam
Accuracy (without tools)Self-reported
10.9%
HealthBench - Realistic health conversations
ScoreSelf-reported
42.5%
HealthBench Hard - Challenging health conversations
Score AI: A model that can solve a problem will typically achieve the correct result. But what about models that cannot fully solve a given problem, or models that may have made a mistake during the solution? A simple binary metric that only checks if the model got the final answer right doesn't provide much insight into how the model is reasoning or where it might be going wrong. The Score metric aims to address this limitation by evaluating not just whether the final answer is correct, but how well the model reasoned throughout its solution attempt. This provides a more nuanced view of model performance and helps identify specific weaknesses in reasoning capabilities.Self-reported
10.8%
TAU-bench Retail benchmark
functions AI: Function calling, a crucial method for integrating language models with external tools and services, represents a significant advance in AI capability. In function calling, a language model parses a request, determines that a specific external function should be invoked, and formats the necessary parameters for that function in a structured format (typically JSON). The implementation can vary across models and platforms. For example, OpenAI's API allows developers to define function schemas that the model can reference, while open-source implementations like LangChain provide frameworks to handle the execution of identified functions. When a model employs function calling, it typically: 1. Recognizes when a request requires external computation or data 2. Selects the appropriate function based on the need 3. Structures the required arguments correctly 4. Generates proper syntax for the function call This capability transforms language models from pure text generators into systems that can trigger specific actions in software applications, query databases, or interact with external APIs. The model doesn't execute the function itself but rather identifies when a function should be called and prepares the call appropriately. Function calling is particularly valuable for: - Retrieving real-time information - Performing calculations - Executing database operations - Interfacing with external services - Controlling application features The ability to properly identify when function calling is needed (versus handling a request directly) and to correctly format the required parameters represents a sophisticated form of reasoning that bridges natural language understanding and programmatic executionSelf-reported
54.8%

License & Metadata

License
apache_2_0
Announcement Date
August 5, 2025
Last Updated
August 5, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.