A $600 Card That Runs 10,000 Tokens Per Second
Taalas wants to hardwire LLMs directly into silicon. Their ASIC approach delivers 10x Cerebras speed at a fraction of GPU power draw — but each chip runs exactly one model.

For $600 to $800, a startup called Taalas says it can sell you a PCIe card that runs Qwen 3.5 27B at 10,000 tokens per second. Locally. No cloud, no GPU, no fans screaming at full tilt. The catch is one you might not expect: the model is literally etched into the chip. You can't change it. Ever.
The rumor surfaced on r/singularity this week, drawing 417 upvotes and 176 comments. If even half of what's claimed is real, this is one of the more radical bets in AI hardware right now.
What Taalas Actually Built
Taalas is a 24-person startup that has spent $30 million and raised over $200 million to answer a simple question: what happens if you stop treating AI inference as a software problem and start treating it as a hardware one?
Their answer is an ASIC — application-specific integrated circuit — where the entire neural network is hardwired into silicon. Not loaded from disk, not held in VRAM, not shuffled through a general-purpose compute pipeline. The weights, the architecture, the inference logic: all burned into metal. CEO Ljubisa Bajic calls the approach "Hardcore AI," and the name isn't entirely marketing.
The first product, the HC1, shipped with Llama 8B baked in. Forbes covered the launch in February, reporting 14,357 tokens per second on a World War II history test. That's roughly 10x faster than Cerebras's wafer-scale engine and about 100x faster than conventional GPUs on the same model. The cost per million tokens comes out to 0.75 cents for Llama 8B — compare that to 3.79 cents on the cheapest GPU providers and up to 28.6 cents on more expensive ones.
The power numbers are equally striking. Taalas claims 12 to 15 kilowatts per rack versus the 120 to 600 kilowatts that GPU racks demand. The cards are air-cooled and fit a standard PCIe slot on any server with Intel or AMD CPUs. No liquid cooling loops, no custom enclosures.
Now the rumored next step: a PCIe card with Qwen 3.5 27B hardwired in, targeting 10,000 tok/s at a production cost of $300 to $400. Retail pricing would land somewhere between $600 and $800. For context, that's less than an RTX 3090 and roughly 18x faster than a single Intel Arc Pro B70 running the same model.
The Trade-Off Nobody Can Ignore
Every chip runs exactly one model. That's it. If Qwen releases version 4 next month, your Taalas card still runs 3.5 27B. If you need a different model for a different task, you need a different chip.
Taalas addresses this partially — the company says it can produce a new chip in about two months by swapping just two metal layers in the fabrication process. And the cards do support LoRA fine-tuning with a configurable context window, so you're not completely locked into a static system. But the fundamental constraint remains: this is not a general-purpose device.
The 176-comment Reddit thread circled this trade-off relentlessly. Some saw it as a dealbreaker — data centers don't want dozens of different SKUs cluttering their racks, and the operational complexity of managing model-specific hardware is non-trivial. Others argued that for high-volume inference on a single popular model, the economics are so favorable that the rigidity doesn't matter. If you're running Qwen 3.5 27B as your primary production model and processing millions of requests daily, a card that does it at 0.75 cents per million tokens while sipping power is hard to argue with.
Where This Fits in the Inference Landscape
The local AI hardware market has gotten dramatically more competitive in the past few months. NVIDIA's DGX Sparks brought desktop-class inference to developers. Intel's Arc Pro B70 threw 32GB of VRAM at the problem for under a thousand dollars. GPU prices across the board are trending downward.
All of those solutions share one trait: flexibility. You can load any model, swap quantization levels, experiment with new architectures as they appear. Taalas throws that flexibility away and gets raw speed and power efficiency in return. It's a fundamentally different philosophy — closer to how crypto mining evolved from GPUs to purpose-built ASICs — and it appeals to a fundamentally different buyer.
The upcoming HC2, planned for winter 2026, aims to push the concept further: frontier-scale LLMs, multi-chip configurations, terabyte-scale parameter counts. If Taalas can deliver that, the conversation shifts from "interesting startup" to "serious infrastructure alternative."
What to Make of This
Taalas is betting that the AI industry will converge on a small number of dominant models deployed at enormous scale — and that for those specific models, purpose-built silicon will obliterate general-purpose hardware on cost, speed, and power. It's a bet against the pace of model turnover and in favor of inference economics.
Whether that bet pays off depends on how quickly the model landscape stabilizes. If the industry keeps releasing meaningfully better models every few months, burning last quarter's model into silicon is an expensive way to fall behind. If the gains start plateauing — if Qwen 3.5 27B is good enough for the next year — then a $600 card that runs it at 10,000 tokens per second starts looking less like a curiosity and more like the future of commodity inference.
Founded two and a half years ago with a team of 24, Taalas has already turned $200 million in funding into working silicon. That's the part that makes this more than a white paper. The HC1 exists. It ships. It's fast. The question is whether the world wants inference hardware that's blindingly good at one thing and useless at everything else. Based on the Reddit thread, at least 417 people think the answer might be yes.


