Updated daily — May 10, 2026

AI Model Comparison & LLM Leaderboard 2026

Traictory benchmarks 233 AI models across 12 standardized tests — including GPQA, SWE-Bench, and MMLU — so you can identify the best model for coding, research, or automation in under 60 seconds.

Explore Models Benchmarks

200+

AI Models

API Providers

343

Benchmarks

AI Models Catalog — Compare 233 LLMs

Top AI models ranked by release date, with benchmark scores, API pricing, and context windows. Compare open-source and proprietary LLMs from OpenAI, Anthropic, Google, Meta, and more.

Showing 1-10 of 233

#
1	MiniMax	MiniMax M2.7	205K	Mar 18, 2026	-	-	No	0.30	1.20	-	-	-	-	-	-	-
2	Xiaomi	MiMo-V2-Pro	-	Mar 18, 2026	-	-	No	-	-	-	-	-	-	-	-	-
3	xAI	Grok 4.20	-	Mar 9, 2026	-	-	Yes	-	-	-	-	-	-	-	-	-
4	OpenAI	GPT-5.4	1000K	Mar 5, 2026	-	-	Yes	2.50	15.00	-	93.0%	-	-	-	99.0%	-
5	Alibaba	Qwen3.5 27B	-	Mar 1, 2026	-	27.0B	No	-	-	-	-	-	-	-	-	-
6	Alibaba	Qwen3.5 122B A10B	-	Mar 1, 2026	-	122.0B	No	-	-	-	-	-	-	-	-	-
7	Alibaba	Qwen3.5 35B A3B	-	Mar 1, 2026	-	35.0B	No	-	-	-	-	-	-	-	-	-
8	NVIDIA	Nemotron 3 Super (120B A12B)	-	Mar 1, 2026	-	120.0B	No	-	-	-	83.0%	-	-	-	-	-
9	Alibaba	Qwen3.5 9B	-	Mar 1, 2026	-	9.0B	No	-	-	-	-	-	-	-	-	-
10	OpenAI	GPT-5.4 mini	400K	Mar 1, 2026	-	-	Yes	0.42	5.50	-	88.0%	-	-	-	93.0%	-

Page 1 of 24

View all text models →

Best AI for May 2026

Intelligence index ranking based on 6 key benchmarks: GPQA (scientific reasoning, 20%), SWE-Bench (real-world coding, 20%), Tau2 (tool use, 20%), MMLU (knowledge, 15%), MMMU-Pro (multimodal, 15%), and ARC-AGI (reasoning, 10%). Scores are normalized and weighted — models with fewer than 2 benchmark results are excluded.

Gemini 3.1 Pro

Google

GPQA 94%SWE-Bench 81%

Tau2 99%GPQA 91%SWE-Bench 78%

GPQA 90%SWE-Bench 80%

85.3%

overall rating

Best pick by task

Llama 3.2 3B Instruct

Scroll to view categories

Llama 3.2 3B Instruct

How we calculate the overall rating

GPQA(20%)+SWE-Bench(20%)+Tau2(20%)+MMLU(15%)+MMMU-Pro(15%)+ARC-AGI(10%)

Weights are normalized by available data. Models with fewer than 2 benchmarks are excluded from the ranking.

The best AI models of 2026 excel at specialized tasks: from writing code to analyzing documents. We update the ranking daily so you can choose the optimal model for your needs — whether it's development, research, or automating routine processes.

AI News

Latest events, releases, and breaking AI news

All News

openaisam-altman

Sam Altman's Superintelligence New Deal: Robot Taxes, Wealth Funds, and a 4-Day Week

OpenAI's CEO tells Axios that superintelligence is imminent and publishes a 13-page blueprint calling for robot taxes, a public wealth fund, and a 32-hour workweek.

Apr 82 min

data-centersinfrastructure

13 Bullets and a 'No Data Centers' Note: AI Backlash Turns Violent

Someone fired 13 shots at an Indianapolis councilor's home after he backed a data center project. A note reading 'No Data Centers' was left at the door.

Apr 82 min

researchenergy

Tufts Researchers Cut AI Energy Use by 100x With a Simple Idea: Logic

A neuro-symbolic AI system from Tufts University slashes energy consumption by 100x while hitting 95% accuracy, compared to 34% for standard approaches.

Apr 82 min

openaisam-altman

Can Sam Altman Be Trusted? The New Yorker's 18-Month Investigation

Ronan Farrow's New Yorker investigation draws on 100+ interviews and internal documents to question whether the CEO of OpenAI can be trusted with superintelligence.

Apr 82 min

All News

Token Generation Speed Demo — Live Comparison

See the real difference in AI model response speeds. Based on Traictory's live API measurements, ultra-fast models complete a 500-word response in under a second, while slower models take 5–10 seconds for the same output. Adjust the token/s values below to compare.

t/s

Scroll horizontally to compare speeds

t/s

Values reset every 15 seconds to demonstrate different speeds

LLM Rankings — Best Models by Category

Model leaderboard ranked by verified benchmark scores. Each category uses a specific evaluation — SWE-Bench for coding, GPQA for knowledge, throughput metrics for speed.

Best Model — Code

SWE-Bench Benchmark

Claude Opus 4.5

Anthropic

81.0

Gemini 3 Flash

Google

MiniMax M2.5

MiniMax

80.0

Best Multimodal Model

Multimodal with best GPQA

Qwen3 VL 32B Thinking

Alibaba

14450.8

Gemini 3.1 Pro

Google

94.3

Qwen3.5-397B-A17B

Alibaba

93.3

Best Model — Knowledge

GPQA Benchmark

Gemini 3.1 Pro

Google

94.3

GPT-5.4

OpenAI

93.0

Claude Opus 4.6

Anthropic

91.3

Longest Context

Maximum input tokens

Gemini 3.1 Pro

Google

1.0M tokens

Gemini 3 Flash

Google

1.0M tokens

GPT-5.4

OpenAI

1.0M tokens

Cheapest API

Input token cost

Llama 3.2 3B Instruct

Fastest API

Throughput (tok/s)

GPT OSS 20B

OpenAI

1000 tokens/s

GPT OSS 120B

OpenAI

500 tokens/s

Step-3.5-Flash

StepFun

403 tokens/s

Scroll horizontally to view all categories

Best Model — Code

SWE-Bench Benchmark

Claude Opus 4.5

81.0

Gemini 3 Flash

MiniMax M2.5

80.0

Best Multimodal Model

Multimodal with best GPQA

Qwen3 VL 32B Thinking

14450.8

Gemini 3.1 Pro

94.3

Qwen3.5-397B-A17B

93.3

Best Model — Knowledge

GPQA Benchmark

Gemini 3.1 Pro

94.3

GPT-5.4

93.0

Claude Opus 4.6

91.3

Longest Context

Maximum input tokens

Gemini 3.1 Pro

1.0M tokens

Gemini 3 Flash

1.0M tokens

GPT-5.4

1.0M tokens

Cheapest API

Input token cost

Llama 3.2 3B Instruct

$0.01 / 1M tokens

Gemma 3 4B

$0.02 / 1M tokens

Nova Micro

$0.03 / 1M tokens

Fastest API

Throughput (tok/s)

GPT OSS 20B

1000 tokens/s

GPT OSS 120B

500 tokens/s

Step-3.5-Flash

403 tokens/s

Choosing the Right AI Model — Decision Guide

Not sure which model to pick? Use this guide to find the best LLM for your specific use case. Each recommendation is based on verified benchmark scores and real API pricing data from Traictory's daily evaluation cycle.

Reasoning Models vs Standard Models

Reasoning models (Claude Opus with extended thinking, GPT-5.4 Thinking, DeepSeek-R1) use chain-of-thought processing and score significantly higher on GPQA, math, and complex coding benchmarks — but at 2–5x the token cost and higher latency. Standard models (Claude Sonnet, GPT-4.1, Gemini Flash) offer faster responses at lower cost, ideal for most production use cases. Choose reasoning models only when accuracy on hard problems justifies the premium.

For Developers & Coders

Coding

Code generation, debugging, and software engineering tasks. Prioritize SWE-Bench and HumanEval scores.

Claude Opus 4.5

Top SWE-Bench score — best for complex multi-file tasks

Gemini 3 Flash

Fast + 1M context — ideal for large codebases at lower cost

GPT-4.1

Strong instruction following and code review

For Researchers & Analysts

Research

Scientific reasoning, data analysis, and knowledge-intensive tasks. Prioritize GPQA and MMLU scores.

Gemini 3.1 Pro

Highest GPQA score — PhD-level scientific reasoning

Claude Opus 4.5

Strong across all knowledge benchmarks + long context

GPT-5.4

Top Tau2 score — excellent for tool-augmented research

Budget-Sensitive

Budget

Maximum performance per dollar. Open-source and low-cost API options for automation and batch processing.

Llama 3.2 3B

Strong instruction following at $0.01/M tokens

Gemma 3 4B

Open-weight, self-hostable, competitive quality

Gemini 2.0 Flash

Low cost with solid benchmark scores

Speed-Critical

Speed

Real-time chatbots, autocomplete, and interactive applications where latency matters most.

GPT OSS 20B

1,000 tokens/s — fastest API on the market

Gemini 2.0 Flash

High throughput with good quality balance

Claude Haiku 3.5

Fast and cheap with strong instruction following

How We Compare AI Models — Ranking Methodology

Traictory's composite intelligence index ranks each AI model using 6 independent benchmarks, selected for their ability to measure distinct capabilities that matter in real-world use.

Each model receives a weighted score calculated from its performance on 6 benchmarks. We chose equal weight for the three hardest evaluations (GPQA, SWE-Bench, Tau2 at 20% each) because they best differentiate frontier models — most top LLMs score 90%+ on easier tests like GSM8K or HellaSwag, making those less useful for ranking. Knowledge and multimodal benchmarks (MMLU, MMMU-Pro) receive 15% each, and abstract reasoning (ARC-AGI) receives 10%.

Models with fewer than 2 benchmark results are excluded from the ranking. Scores are normalized per benchmark so that different scoring scales (0–1 vs 0–100) are comparable. The final composite score determines the model's position in our leaderboard.

20%

GPQA Diamond

PhD-level scientific reasoning (physics, chemistry, biology)

20%

SWE-Bench Verified

Real-world software engineering — resolving GitHub issues

20%

Tau2

Complex tool-calling and multi-step API orchestration

15%

MMLU

Broad knowledge across 57 academic subjects

15%

MMMU-Pro

Expert-level multimodal understanding (images, charts, diagrams)

10%

ARC-AGI

Abstract reasoning and pattern recognition

Data Sources & Transparency

Benchmark data is sourced from official API provider publications, independently published evaluation papers, and community-run leaderboards. Key academic references: GPQA (Rein et al., 2023, arXiv:2311.12022), SWE-Bench (Jimenez et al., 2023, arXiv:2310.06770), MMLU (Hendrycks et al., 2020, arXiv:2009.03300). Self-reported scores from vendors are cross-checked against independent reproductions where available.

Disclaimer: Benchmark scores reflect specific test conditions and may not fully predict real-world performance. Traictory does not guarantee the accuracy of vendor-reported scores and recommends independent validation for production use cases. Our composite ranking is one view of model quality — task-specific evaluations may yield different results.

Understanding AI Benchmarks — Key Tests Explained

Traictory tracks 300+ AI benchmarks to help you identify which model excels at each task. For coding, SWE-Bench and HumanEval are the strongest predictors of real-world performance. For reasoning, GPQA Diamond tests PhD-level science questions. For general knowledge, MMLU covers 57 academic subjects. Here are the primary benchmarks in our ranking formula:

GPQA Diamond

Science

PhD-level multiple-choice questions in physics, chemistry, and biology (198 questions). Random guessing yields ~25%. Human PhD experts score approximately 65–70%. Source: Rein et al. 2023.

SWE-Bench Verified

Engineering

Real-world GitHub issues that models must resolve by writing code patches. Tests end-to-end software engineering ability — from understanding a bug report to submitting a working fix. Source: Jimenez et al. 2023.

MMLU

Knowledge

Massive Multitask Language Understanding — 57 academic subjects from STEM to humanities (14,042 questions). The standard benchmark for broad knowledge evaluation. Source: Hendrycks et al. 2020.

HumanEval

Programming

164 Python programming tasks testing code generation from docstrings. Measures functional correctness via Pass@1 — the model must produce a working solution on the first attempt.

Tau2

Tool Calling

Comprehensive tool-calling benchmark testing multi-step API interactions with complex parameter schemas. Critical for evaluating agentic AI capabilities in real automation scenarios.

ARC-AGI

Reasoning

Abstract Reasoning Corpus — visual pattern recognition tasks that test genuine reasoning ability rather than memorization. Considered one of the hardest tests for AI general intelligence.

MMMU-Pro

Multimodal

Expert-level multimodal understanding — questions requiring joint reasoning over images, charts, diagrams, and text across 30+ disciplines.

GSM8K

Math

8,500 grade-school math word problems requiring multi-step arithmetic reasoning. A baseline test — frontier models now score 95%+, making it useful for comparing smaller and open-source models.

Arena Hard

Dialogue

500 challenging real-world user prompts from Chatbot Arena. Tests instruction following, creativity, and nuanced reasoning on tasks where models frequently disagree.

HellaSwag

Understanding

Common-sense reasoning and sentence completion test. Models must predict the most plausible continuation of everyday scenarios. Frontier models now exceed 95% accuracy.

ComplexFuncBench

Tool Calling

Complex function calling with nested parameters, multi-step chaining, and error recovery. Tests whether models can reliably orchestrate real-world API workflows.

ToolBench

Tool Calling

Practical API usage benchmark with 16,000+ real-world REST APIs. Evaluates planning, API selection, and parameter extraction for autonomous tool use.

Scroll horizontally to view all benchmarks

GPQA Diamond

Science

PhD-level multiple-choice questions in physics, chemistry, and biology (198 questions). Random guessing yields ~25%. Human PhD experts score approximately 65–70%. Source: Rein et al. 2023.

SWE-Bench Verified

Engineering

MMLU

Knowledge

Massive Multitask Language Understanding — 57 academic subjects from STEM to humanities (14,042 questions). The standard benchmark for broad knowledge evaluation. Source: Hendrycks et al. 2020.

HumanEval

Programming

164 Python programming tasks testing code generation from docstrings. Measures functional correctness via Pass@1 — the model must produce a working solution on the first attempt.

Tau2

Tool Calling

Comprehensive tool-calling benchmark testing multi-step API interactions with complex parameter schemas. Critical for evaluating agentic AI capabilities in real automation scenarios.

ARC-AGI

Reasoning

Abstract Reasoning Corpus — visual pattern recognition tasks that test genuine reasoning ability rather than memorization. Considered one of the hardest tests for AI general intelligence.

MMMU-Pro

Multimodal

Expert-level multimodal understanding — questions requiring joint reasoning over images, charts, diagrams, and text across 30+ disciplines.

GSM8K

Math

8,500 grade-school math word problems requiring multi-step arithmetic reasoning. A baseline test — frontier models now score 95%+, making it useful for comparing smaller and open-source models.

Arena Hard

Dialogue

500 challenging real-world user prompts from Chatbot Arena. Tests instruction following, creativity, and nuanced reasoning on tasks where models frequently disagree.

HellaSwag

Understanding

Common-sense reasoning and sentence completion test. Models must predict the most plausible continuation of everyday scenarios. Frontier models now exceed 95% accuracy.

ComplexFuncBench

Tool Calling

Complex function calling with nested parameters, multi-step chaining, and error recovery. Tests whether models can reliably orchestrate real-world API workflows.

ToolBench

Tool Calling

Practical API usage benchmark with 16,000+ real-world REST APIs. Evaluates planning, API selection, and parameter extraction for autonomous tool use.

View Rankings

Verified Data Sources

Benchmark scores sourced from official API providers, peer-reviewed papers, and independent evaluation platforms.

Updated Daily

Rankings refreshed daily with the latest model releases, benchmark results, and API pricing changes.

Independent & Transparent

No sponsorships or paid placements. Our ranking formula and data sources are fully disclosed.

About Traictory. Built by Vlad Makarov — tracking AI model releases and benchmark results across 12 evaluation frameworks. For questions about our methodology or to report a data error, reach out via team@traictory.com.

Last reviewed: May 10, 2026