All News
microsoftmaispeech-to-textttsimage-generationsuleyman

Microsoft Just Built Its Own AI Models — With Teams of 10 People

Mustafa Suleyman's superintelligence team ships MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — state-of-the-art models built by tiny teams that beat OpenAI Whisper and undercut every hyperscaler on price.

Vlad MakarovVlad Makarovreviewed and published
9 min read
Microsoft Just Built Its Own AI Models — With Teams of 10 People

Ten people built a speech transcription model that beats OpenAI Whisper on every single one of 25 benchmarked languages. Another team of fewer than ten created an image generator that debuted top-three on Arena.ai. These aren't scrappy startup stories — this is Microsoft, a $3 trillion company that until five months ago was contractually prohibited from building its own frontier AI.

The First Salvo From Suleyman's Team

On April 2, Microsoft launched three foundational models through its AI Foundry platform: MAI-Transcribe-1 for speech-to-text, MAI-Voice-1 for text-to-speech, and MAI-Image-2 for image generation. Together, they represent the first public output from Mustafa Suleyman's superintelligence team, which he formally stood up in October 2025 after Microsoft renegotiated its contract with OpenAI.

That contract renegotiation is the real backstory here. Until October 2025, Microsoft was legally barred from independently pursuing artificial general intelligence — part of the original 2019 deal that gave Microsoft access to OpenAI's models in exchange for cloud infrastructure. When OpenAI started seeking compute beyond Microsoft — deals with SoftBank and others — Microsoft renegotiated. The new terms freed the company to build its own frontier models while retaining license rights to everything OpenAI builds through 2032.

"Back in September of last year, we renegotiated the contract with OpenAI, and that enabled us to independently pursue our own superintelligence," Suleyman told VentureBeat. He was quick to add that the OpenAI partnership remains intact through at least 2032 — but the direction is clear.

What the Models Actually Do

MAI-Transcribe-1 is the headline release. The speech-to-text model achieves the lowest average Word Error Rate on the FLEURS benchmark across the top 25 languages by Microsoft product usage, averaging 3.8% WER. It beats OpenAI's Whisper-large-v3 on all 25 languages, Google's Gemini 3.1 Flash on 22 of 25, and ElevenLabs Scribe v2 and OpenAI's GPT-Transcribe on 15 of 25 each. Batch transcription runs 2.5x faster than Microsoft's previous Azure Fast offering. It's already being tested inside Copilot Voice mode and Teams.

MAI-Voice-1 generates 60 seconds of natural-sounding audio in a single second — 60x real-time speed. It preserves speaker identity across long-form content and can create custom voices from just a few seconds of sample audio. Priced at $22 per million characters.

MAI-Image-2 debuted as a top-three model family on the Arena.ai leaderboard and delivers at least 2x faster generation compared to its predecessor. It's rolling out across Bing and PowerPoint, priced at $5 per million input tokens and $33 per million image output tokens. WPP, one of the world's largest advertising companies, is among the first enterprise partners building with it at scale.

The "10-Person Team" Revelation

Perhaps the most remarkable detail is the team size. "The audio model was built by 10 people, and the vast majority of the speed, efficiency and accuracy gains come from the model architecture and the data that we have used," Suleyman said. "Our image team, equally, is less than 10 people."

This challenges the prevailing narrative that frontier AI requires thousands of researchers and astronomical headcount budgets. Meta, by contrast, has pursued what Suleyman characterized as a strategy of "hiring a lot of individuals, rather than maybe creating a team" — including compensation packages of $100M–$200M for top researchers.

Suleyman described his team's work environment as closer to a startup trading floor than a traditional Microsoft engineering org: "Groups of people around circular tables, on laptops instead of big screens. They're basically vibe coding, side by side all day, morning till night, in rooms of 50 or 60 people."

Small teams producing state-of-the-art results fundamentally change the economics. If Microsoft can build best-in-class transcription with 10 engineers and half the GPUs of competitors, the margin structure of their AI business looks very different from companies burning cash to achieve similar benchmarks.

Aggressive Pricing as Strategy

The pricing is deliberately designed to undercut everyone:

ModelPricingComparable To
MAI-Transcribe-1Competitive with Azure pricingWhisper, Gemini Flash, ElevenLabs Scribe
MAI-Voice-1$22 / 1M charactersElevenLabs, Resemble AI
MAI-Image-2$5 / 1M input tokens, $33 / 1M image tokensDALL-E, Midjourney API

"We're pricing them to be the very best of any hyperscaler. They will be the cheapest of any of the hyperscalers out there — Amazon, and obviously Google," Suleyman said. "That's a very conscious decision."

This makes strategic sense for Microsoft. The company can amortize development costs across its massive enterprise customer base while using these models to reduce internal costs for Teams, Copilot, Bing, and PowerPoint. In his March internal memo, Suleyman wrote that his models would "enable us to deliver the COGS efficiencies necessary to serve AI workloads at the immense scale required in the coming years."

The Independence Play

The announcement lands during Microsoft's worst stock quarter since the 2008 financial crisis, with the stock down roughly 17% year-to-date as investors demand proof that AI spending translates into revenue. These models — built cheaply, priced aggressively, and already shipping in Microsoft products — are Suleyman's first concrete answer.

But transcription, voice, and image generation are just the beginning. When asked about building a large language model to compete with GPT at the frontier, Suleyman was blunt: "We absolutely are going to be delivering state of the art models across all modalities. Our mission is to make sure that if Microsoft ever needs it, we will be able to provide state of the art at the best efficiency, the cheapest price, and be completely independent."

He described a multi-year roadmap, with Nadella personally flying to a team gathering in Miami to lay out "the roadmap of everything we need to achieve for our AI self-sufficiency mission over the next 2, 3, 4 years." Microsoft already has a strong foundation with its Phi model family for smaller-scale reasoning — a frontier LLM would complete the picture.

What This Means for the Industry

Microsoft positioning itself as an independent AI lab — while maintaining its OpenAI partnership and offering Anthropic's Claude through Foundry — creates a genuinely new competitive dynamic. The company can now offer customers a stack where models from OpenAI, Anthropic, Google, and Microsoft itself all compete on the same platform. Suleyman calls this being "a platform of platforms."

For the broader ecosystem, the small-team, high-efficiency approach validates what companies like DeepSeek have been demonstrating: you don't need thousands of researchers to build world-class models. Architecture and data quality matter more than headcount. The "humanist AI" branding and emphasis on clean data provenance also positions Microsoft favorably in an environment where copyright lawsuits and training data concerns are mounting across the industry.

The big question is whether Suleyman's team can replicate these results in the domain that actually matters most — general-purpose language models. Building specialized models for transcription and image generation is a different order of complexity from a frontier LLM. But consider what he's demonstrated so far: three best-in-class or near-best models, built by tiny teams, running on half the industry-standard GPU footprint, priced below every competitor. That's a strong opening hand.

Related Articles

Scroll down

to load the next article