All News
arc-agibenchmarkcholletagigeminigptclaude

Every Frontier AI Model Scored Under 1% on ARC-AGI-3. Humans Got 100%.

Chollet's new benchmark drops the same week Jensen Huang declared AGI. GPT-5.4 scored 0.26%. Claude Opus 4.6 scored 0.25%. The gap with humans is 99+ points.

Vlad MakarovVlad Makarovreviewed and published
9 min read
Every Frontier AI Model Scored Under 1% on ARC-AGI-3. Humans Got 100%.

Two days after Jensen Huang declared "we've achieved AGI" and one day after the man who coined the term agreed, François Chollet dropped 135 interactive puzzle games and asked every frontier model to play them. The results were humbling.

The Scores

ModelScore
Gemini 3.1 Pro (Preview)0.37%
GPT-5.40.26%
Claude Opus 4.60.25%
Grok 4.20 (Beta Reasoning)0.00%
Humans (untrained, first try)100%

That's not a typo. Every frontier model scored under 1%. Regular humans — with no instructions, no training, no stated goals — solved all 135 environments on their first attempt.

What ARC-AGI-3 Actually Tests

Previous ARC benchmarks were static puzzles: show a pattern, predict the next one. ARC-AGI-1 took four years to go from 0% to saturation. ARC-AGI-2 lasted about a year before Gemini 3.1 Pro hit 77%. Both are now effectively solved.

ARC-AGI-3 is fundamentally different. It's 135 interactive game-like environments built by an in-house game studio. Each one is unique. There are no instructions, no rules explained, no goals stated. An AI agent gets dropped into an unfamiliar world and has to figure out what's going on — explore, form hypotheses about the rules, discover what "winning" means, and execute a plan. Exactly what you'd do if someone handed you a game you'd never seen before.

The scoring system, called RHAE (Relative Human Action Efficiency), uses a squared penalty for inefficiency. If a human solves a level in 10 actions and the AI stumbles through it in 100, the AI scores 1% — not 10%. Wandering and guessing are punished harshly. 110 of the 135 environments are kept private to prevent memorization.

The Scaffolding Experiment

The most revealing data point came from Duke University. They built a custom harness for Claude Opus 4.6 and tested it on a single known environment variant called TR87. The result: 97.1% on the familiar environment, 0% on an unfamiliar one.

Chollet's interpretation was pointed: "The scaffolding is the human intelligence; the model is just executing it." When researchers build elaborate prompting strategies, custom harnesses, and thinking tricks around a model, the intelligence isn't in the model — it's in the scaffolding. Strip that away, and you're left with sub-1% performance on genuinely novel tasks.

Chollet's Argument

Chollet timed the release deliberately, posting a thread on X the same day the benchmark launched. His core position: "The G in AGI stands for general." General intelligence doesn't mean being good at many specific tasks you've been trained on. It means facing something genuinely new and figuring it out independently.

"If a normal human with no instructions can do it, and your system can't, then you don't have AGI — you have a very expensive autocomplete that needs a lot of help."

He drew a clear distinction between knowledge and what he calls fluid intelligence: "Models have more knowledge than you do. But they have very low ability to recombine that knowledge at test time to make sense of something they've never seen before. That's the way the entire paradigm works."

The benchmark isn't designed to be unfair. It's designed to test exactly the capability that separates memorization from understanding. And on that specific axis, the gap between humans and machines isn't narrowing — it's a chasm.

The $2 Million Prize

ARC Prize 2026 offers over $2 million across two competition tracks on Kaggle, with all winning solutions required to be open-sourced. The best developer preview agent scored 12.58% during early testing — better than raw API calls, but still a world away from human performance.

Chollet told Sherwood News they're already working on ARC-AGI-4 and ARC-AGI-5, with future versions targeting continual learning, open-endedness, and autonomous invention. He reportedly thinks "ARC-AGI 6-7 will be the last benchmark to be saturated before real AGI comes out."

What This Means for the AGI Debate

The timing created a direct collision between corporate AGI declarations and empirical measurement. Huang defines AGI as an agent that can generate a billion dollars in revenue. Chollet defines it as a system that handles genuinely novel tasks without human assistance. By Huang's definition, we might be there. By Chollet's, we're scoring 0.37%.

The uncomfortable truth is that both definitions have merit. Current models are extraordinarily useful — they code, compress knowledge, and generate real economic value. They're also, apparently, incapable of playing a simple game they haven't seen before without extensive human help.

Whether that gap closes through scaling, architectural breakthroughs, or approaches like LeCun's world models remains the most important open question in AI. ARC-AGI-3 just made it measurable.

Related Articles

Scroll down

to load the next article