All News
mistraltext-to-speechttsvoice-aiopen-weightsrelease

Voxtral TTS Beats ElevenLabs — And It's Open Weight

Mistral AI releases Voxtral TTS, a 4B parameter text-to-speech model that outperforms ElevenLabs in blind tests. Open weights, 9 languages, 90ms latency.

Vlad MakarovVlad Makarovreviewed and published
5 min read
Voxtral TTS Beats ElevenLabs — And It's Open Weight

Sixty-three percent. That's the share of human evaluators who preferred Voxtral's voice output over ElevenLabs Flash v2.5 in blind A/B testing. Mistral AI quietly dropped its first text-to-speech model on March 26, and the benchmarks suggest the voice AI market just got a serious new competitor — one that gives its weights away for free.

The Architecture

Voxtral TTS packs 4 billion parameters into three distinct components: a 3.4B transformer decoder built on the Ministral 3B backbone, a 390M flow-matching module, and a 300M neural audio codec running at a 12.5Hz frame rate. The design is autoregressive — it generates speech tokens sequentially, then uses the flow-matching stage to refine the audio signal into natural-sounding output.

The result is fast. Mistral reports 90 milliseconds time-to-first-audio and roughly 6x real-time synthesis speed. When quantized, the entire model fits in approximately 3GB of RAM, which puts it within reach of modern laptops and edge devices — a theme we're seeing across the industry, from Google's TurboQuant compression work to NVIDIA's Nemotron Cascade approach to efficient inference.

What It Can Do

Nine languages out of the box: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. But the standout feature is voice adaptation. Feed Voxtral as little as five seconds of reference audio, and it clones the voice. More impressively, it handles zero-shot cross-lingual adaptation — record someone speaking French, and Voxtral can generate that same voice in German or Arabic without any additional training.

In human evaluation, the numbers tell a clear story:

  • 62.8% preference over ElevenLabs Flash v2.5 on flagship voices
  • 69.9% preference on voice customization tasks
  • Parity with ElevenLabs v3 (their premium tier) on emotional expressiveness

You can hear the results yourself in Mistral's demo video.

Pricing and Access

Voxtral is available through three channels: Mistral's API, Mistral Studio, and Le Chat. API pricing sits at $0.016 per 1,000 characters — competitive with existing TTS providers, though the real value proposition is elsewhere.

The open weights are on HuggingFace under CC BY NC 4.0. That's a non-commercial license, so startups building voice products will still need to pay for API access or negotiate a commercial license. But for researchers, hobbyists, and companies evaluating TTS internally, the barrier to entry just dropped to zero. The move echoes Alibaba's aggressive open-source strategy — release the weights, build the ecosystem, monetize on enterprise services.

The Business Context

Mistral is making a deliberate bet. Pierre Stock, the company's VP of Science, put it bluntly: "We see audio as a big bet and as a critical and maybe the only future interface with all the AI models." That's a strong claim from a company that built its reputation on text-based LLMs like Mistral Large and Mistral Small.

The timing makes strategic sense. The voice AI market crossed $22 billion globally in 2026, and Mistral — freshly valued at $13.8 billion after a $2 billion Series C led by ASML — has the capital to diversify beyond text. Voxtral TTS joins Voxtral Transcribe (their speech-to-text model) in building a complete audio stack alongside Mistral's existing LLMs, Forge, and AI Studio.

The enterprise angle is explicit. Mistral is pitching data sovereignty — own your voice AI infrastructure rather than routing audio through third-party APIs. For European companies navigating tightening data regulations, that's not a trivial selling point.

What This Means

The TTS market has been dominated by a handful of closed providers — ElevenLabs, Amazon Polly, Google Cloud TTS. Voxtral doesn't overthrow them overnight, but it changes the economics. A 4B parameter model that fits in 3GB of RAM, clones voices from five seconds of audio, and beats the leading commercial offering in blind tests? That's the kind of release that forces pricing pressure across the entire market.

The non-commercial license is the obvious limitation. Researchers and tinkerers get open weights; businesses get an API bill. It's the same dual-licensing playbook that's become standard in the industry, and it means Voxtral's real competitive impact depends on how aggressively Mistral prices commercial access.

For developers already in the Mistral ecosystem, Voxtral slots in naturally — combine it with their LLMs for voice-enabled agents, pair it with Voxtral Transcribe for full-duplex audio conversations. For everyone else, it's worth watching how the open-weight community runs with this. A 3GB TTS model that speaks nine languages is exactly the kind of thing that gets ported to every framework within a week.

Related Articles

Scroll down

to load the next article