Quick answer: What are the Inference Chip Wars?
The Inference Chip Wars are the shift from GPU-only AI infrastructure to inference-specialized hardware competing on cost per token, latency, power efficiency, and context handling. NVIDIA Rubin extends the GPU baseline with a rack-scale platform, while MatX bets on programmable LLM-first acceleration and Taalas bets on model-as-silicon for stable, high-volume workloads. The right decision is workload-specific: fast-changing multi-model stacks usually favor GPUs, while stable inference at scale can favor specialization.
Subscribe for weekly operator-grade AI systems analysis:
https://www.turingpost.com/subscribe
Most AI today runs on general-purpose hardware. To be more concrete: GPUs. For years, it felt borderline unrealistic to crack the ecosystem NVIDIA built around them. Then something interesting happened, and it happened fast.
While this year is still young, we’ve got some really exciting development in the hardware news. And it’s focused on inference. This January, at CES, NVIDIA introduced its next platform, Vera Rubin, as a rack-scale AI supercomputer made out of six chips, tuned for inference-heavy, communication-heavy workloads. Last week, Taalas raised $169 million and is demoing a product that takes the most extreme inference bet in the market: instead of running a model on hardware, it turns a specific model into hardware (model-as-hardware concept). And only yesterday, MatX raised a $500 million Series B for MatX One, an LLM-first accelerator that is ambitious on paper and still a dark horse in practice because it is not shipping yet (but has a giants backing it).
All together it demonstrates that the competition now happens on the operational realities of inference: latency, cost per token, power, and the physics of moving parameters and KV cache around. GPUs remain the baseline. The crack is in the assumption that one baseline will serve every inference regime forever.
In today’s episode, we will do three things:
Update the GPU story: Blackwell was the flagship for the previous wave. Vera Rubin is the new reference point, and it is a platform play.
Explain why inference is the opening challengers can attack, even when training remains GPU-heavy.
Use Taalas and MatX as two opposite bets on what comes next, then place them in the broader landscape of inference-specialized hardware.
If you want the takeaway upfront: MatX is a bold programmable challenger with a long runway. Taalas is the most different idea in the market, and it deserves a closer look because it forces you to ask a hard question: how stable does a model need to be before it makes sense to etch it into silicon?
Why inference became the battlefield
Training is the glamorous part of the AI lifecycle. It is also the part that looks the most like classic High-Performance Computing (HPC): huge matrix multiplies, enormous clusters, performance measured in throughput at scale.
Inference is where the invoice shows up: once a model ships, you run it millions of times in the messy, user-driven world that benchmarks barely approximate. Prompts get weird, context windows bloat, and the system stack keeps accreting retrieval, tool calls, and multi-agent coordination, each layer adding latency, cost, and fresh new ways to fail. Even a “single response” can turn into a chain of generation steps and intermediate reasoning, which changes what you are paying for. For the fundamentals of LLM inference optimization — TensorRT, quantization, and hardware choices — see our guide to optimizing LLM inference.
This shift matters because the bottlenecks shift too. Token generation is sequential at the user level. You can parallelize many things, but you still produce outputs token by token. The model’s “knowledge” lives in its weights, and inference constantly streams weights and reads and writes KV cache. When the workload is constrained by data movement, a design that reduces data movement can beat a design that does more compute, even if the compute is more flexible.
NVIDIA’s own Rubin messaging makes this explicit: inference is now communication-heavy and system-bound, so the platform story centers on rack-scale interconnect, predictable latency, high utilization, and infrastructure for context and data movement.
(If you want to refresh you knowledge about chip types, here is our full guide on AI hardware, where we explained how the main types work)
The GPU era, updated: Vera Rubin is the new baseline
A GPU is still a simple idea at heart: lots of parallel compute cores, fed by high-bandwidth memory, executing a program compiled from software. Inside a modern data-center GPU, you will find billions of transistors etched onto one or more silicon dies, arranged into many lightweight processing cores. Those cores sit next to a hierarchy of on-chip SRAM – registers, shared memory, and large L2 cache – which acts as a high-speed staging area and tries to catch reuse before the GPU has to reach out to external memory.
The cores are also paired with stacked High-Bandwidth Memory (HBM), usually sitting on the same package. HBM itself is built from vertically stacked DRAM layers connected through a silicon interposer so data can move quickly between memory and compute.
When you run a language model on a GPU, the model’s weights live in that memory. The GPU keeps fetching them as it generates tokens. Intermediate results shuttle between SRAM and HBM. Attention reads and writes its KV cache. It is a powerful, flexible pipeline, and it is also why inference can become expensive: moving the model’s own “stuff” is often the limiting factor.
Why GPUs won (and keep winning)
First, they fit the math of deep learning. Matrix and tensor operations parallelize well, and GPUs are built to split a big job into thousands of smaller tasks and compute them simultaneously.
Second, they are programmable. A model is software in the GPU paradigm. You can run many models on the same hardware, swap checkpoints, tune kernels, change frameworks, and keep your infrastructure useful while the model landscape mutates.
Third, they come with an ecosystem. NVIDIA’s software stack (CUDA) has been a compounding advantage for over a decade. Even when a competing chip has an interesting architecture, it still has to earn its place in real deployment pipelines.
Vera Rubin changes the unit of thinking
Vera Rubin is NVIDIA treating the rack as the unit of performance and efficiency. NVIDIA describes the Rubin platform as “six new chips” that together form an AI supercomputer, and positions the flagship Vera Rubin NVL72 rack as a rack-scale accelerator within a larger AI factory. Production ramps in the second half of 2026, which is also why “Blackwell vs challengers” already reads like last season’s debate.
Here are the Rubin facts that matter for an inference-centric conversation:
Rubin GPU: 50 PFLOPS inference performance per Rubin GPU and 288GB HBM4 per package, designed around low-precision inference formats.
Rubin NVL72: the rack-scale system integrates 72 Rubin GPUs and 36 Vera CPUs, alongside ConnectX-9 SuperNICs, BlueField-4 DPUs, and NVLink 6 switching as the scale-up fabric.
NVLink 6 scale-up: 3.6 TB/s NVLink bandwidth per GPU and up to 260 TB/s of NVLink bandwidth per rack.
Cost-per-token framing: NVIDIA claims up to a 10× reduction in inference token cost compared with Blackwell in its platform announcement, and the developer blog shows a cost-per-token comparison for long-context, reasoning-heavy inference workloads.
Context as infrastructure: the Rubin platform announcement includes an Inference Context Memory Storage Platform built around BlueField-4, designed to store and manage inference context for multi-turn and agentic workloads.
This doesn’t make Blackwell “bad.” It makes Vera Rubin a clearer description of where NVIDIA believes the next bottleneck is: system-level memory and interconnect, sustained utilization, and the practical conversion of megawatts into usable tokens. It also clarifies what challengers have to do: they do not need to beat Rubin everywhere. They need to dominate a specific inference regime and make the tradeoffs acceptable.
What about newcomers? Taalas and MatX
Taalas: when the chip becomes the model
Taalas is one of the rare cases where a hardware company forces you to re-evaluate what “deployment” even means. Their bet is simple to state and hard to execute →
inference is important enough that it deserves fully specialized chips, one optimized for a specific model.

Image Credit: Taalas HC1 hard-wired with Llama 3.1 8B model, “The path to ubiquitous AI” Taalas blog post
Taalas proposes a conceptual shift: instead of accelerating models on hardware, hard-code the model itself into hardware. Their term for this is Hardcore Models, abbreviated HC. The name is a bold choice, and you can almost hear the person in the room who suggested “Model-as-Silicon” getting politely ignored. Though, we think that Model-as-Hardware is a better term.
What makes Taalas different is not that it uses an Application-Specific Integrated Circuit (ASIC). Many companies use ASICs. Taalas takes ASIC logic and applies it at the level of a specific neural network instance.
The manufacturing trick: structured customization in two metal layers
Taalas assembles a nearly complete chip with roughly 100 layers, then performs the final customization on two metal layers to hardwire a particular model, with TSMC taking about two months to complete fabrication of a chip customized for a specific model. That is incredibly fast.
And this matters because it turns “one model per chip” from a philosophical statement into an operational loop. Full custom silicon cycles are long. If you can change two layers on a largely prefabricated base and do it on a two-month cadence, you have something closer to a hardware compiler pipeline, with a new strategic question attached: what is the minimum stability threshold where it becomes rational to tape out a model?
What HC1 is: a hardwired Llama 3.1 8B system
Taalas’ first released system, HC1, is optimized for Llama 3.1 8B. From what we know, the entire 8B model fits on a single 815 mm² die (TSMC N6), with weights stored on-chip in a mask-ROM recall fabric, and SRAM used for KV cache and for updates such as LoRA. The announced speed is 16,000-17,000 per second per user. If you try it yourself on chatjimmy.ai, you can see how this model on a chip builds a game for 0,06 seconds. It’s faster then you can think.
For comparison, the latest fast Codex-Spark from OpenAI is 1200 tokens per second.

Image Credit: “The path to ubiquitous AI” Taalas blog post
Taalas frames the core principle as transforming a model into custom silicon, with Hardcore Models that support fine-tuning and are faster and lower power than software-based implementations. But how do they actually do it?
Why it can be fast: reduce the memory wall by construction
The Taalas story is a memory story.
In the standard GPU workflow, the model is software, weights live in HBM, and inference repeatedly fetches weights and moves intermediate values through a stack of abstractions: kernels, scheduling, memory hierarchies, interconnect. That is flexible and powerful. It is also overhead.
Taalas collapses the abstraction layers. If the model’s weights and structure are physically realized on-chip, you bypass a large share of memory traffic and scheduling complexity. There is no dynamic model loading. There is no need to keep HBM fed with weights from outside. There is no need to schedule kernels for a model that never changes.
The main shift in this architecture is that the chip stops running the model and becomes the model itself.
Taalas’ method borrows from structured ASIC approaches, where changing only a couple of masks can alter the circuit, allowing Taalas to hardwire a model without a full redesign each time. It’s not new in semiconductor history, but no one has ever done it in AI.

Image Credit: This chip runs a “baked” Llama so fast it looks like a glitch (Taalas HC1), Turing Post Youtube video
If your bottleneck is memory bandwidth and data movement, this kind of architecture can produce an outlier outcome.
Not without limitations: the tradeoffs are actually fundamental
The tradeoffs here are not “engineering inconveniences.” They are the core of the bet.
Tradeoff 1: Flexibility
A GPU is a programmable factory. You can change models, update checkpoints, patch your stack, and redeploy. A Hardcore Model behaves more like an appliance. If you need a meaningfully different model, you are talking about a new chip, even when a two-metal-layer cycle makes it faster than a full redesign.Tradeoff 2: Model evolution cadence
Many production deployments change models more often than teams admit in public. Safety patches, data drift, product improvements, and platform changes create pressure to update. If your cadence is monthly, a two-month hardware cycle becomes a constraint. If your cadence is yearly, it starts to look plausible.Tradeoff 3: Scaling to larger models
Hardwiring an 8B model is one thing. Hardwiring a frontier reasoning model is another. EE Times notes that scaling to a DeepSeek R1 class system could require many chips and many incremental tape-outs, even if each tape-out is “only” two metal layers.
It means 30 incremental tape-outs, which is the annoying part, but the tape-outs are pretty cheap because it’s only two masks,” Bajic said. “The big thing at the root of this idea is the assumption that the customer is willing to commit to this [chip/model] for a year. There will definitely be a lot of people who won’t, but some people will.
Tradeoff 4: Numerics and quality
Taalas’ first-generation silicon platform HC1 runs on an aggressive quantization scheme: a custom 3-bit base format with selected weights stored at 6 bits. This allow for smaller arithmetic units, higher memory density and lower power per operation, but introduces measurable quality degradation compared to GPUs. That’s why second-generation platform HC2 has standard 4-bit floating-point to narrow the quality gap.Tradeoff 5: Benchmark interpretation
Tokens per second is a seductive metric, and it is also easy to misunderstand. It depends on context length, decode versus prefill, batching, concurrency, and what “per user” means in practice. Even if the claim is correct for a specific configuration, the deployment meaning can vary. The right way to read the number is as a proof that an architecture shift can move the frontier, not as a guarantee that every workload will see that exact number.
Where Taalas fits, if it fits
Taalas makes the most sense when you can say “yes” to all of the following:
One model dominates your inference volume and you can keep it stable long enough to amortize dedicated silicon.
Latency is a product constraint.
Cost per token and power per token dominate your unit economics.
You can treat model updates as a planned hardware cycle rather than a daily redeploy habit.
If your reality is constant model iteration and a zoo of models, GPUs remain the simpler answer. Taalas can still be useful for a subset of the stack, but it will not replace a general-purpose fleet. That is why this is interesting: inference hardware is splitting by workload regime, and no single architecture is going to look optimal across all of them.
Here is the analytical overview of Taalas hardware. Don’t forget to watch it to get the full picture →
MatX: the programmable challenger, still a dark horse
If Taalas is extreme specialization, MatX is closer to the classic dream: build a programmable chip that outperforms GPUs for LLM workloads without forcing you into one-model-per-silicon commitments.
MatX just raised a $500M Series B and says the funding will “wrap up development and quickly scale manufacturing,” with tapeout in under a year. TechCrunch reports MatX’s goal is to make its processors 10× better at training LLMs and delivering results than NVIDIA’s GPUs.
MatX’s positioning for MatX One is explicit about the pain points it wants to fix:
MatX One is built around a splittable systolic array: a big MAC grid that can be partitioned into smaller independent tiles, so the chip stays busy across the awkward, skinny, batch-1 matrix shapes you see in real inference, instead of shining only on the fattest dense GEMMs.
MatX describes a design that pairs low latency from on-chip SRAM with HBM for long context, aiming to cover both interactive decode and large-context regimes.
It hints at “a fresh take on numerics,” which is usually where hardware teams hide the most consequential decisions.
So why call it a dark horse? Because what exists today is the promise, the funding from big names like Jane Street and Karpathy, and the architectural intent, while the deployable reality is still ahead. MatX’s own statements frame the product as not yet shipping, and public reporting points to 2027 as the start of shipping and deployment. A lot can happen between “tapeout in under a year” and “a system you can run 24/7 with a real compiler, real kernels, reliability, and support.”
MatX still deserves attention because it represents the programmable challenger category with serious capital behind it. Its promise is familiar: give teams a chip that feels like a better GPU for LLMs. Its challenge is also familiar: silicon is hard, and ecosystems are harder.
The broader inference ASIC landscape, organized by the core bet
Taalas and MatX are the headline contrasts: extreme specialization versus ambitious programmability. The rest of the market is more diverse than a list of company names suggests. A better map is to group by what each architecture is optimizing.
Hyperscaler vertical integration: Google TPU Ironwood
Google’s Ironwood is hyperscaler logic in its purest form: build custom silicon, deploy it at enormous scale, and offer it through the cloud.
Google introduced Ironwood as its seventh-generation TPU and described it as the first TPU designed specifically for inference, aimed at “thinking” and inference-heavy AI models at scale. Google’s TPU documentation describes TPU7x as the first release in the Ironwood family, designed for large-scale training and inference, including dense and MoE models and decode-heavy inference, scaling up to 9,216 chips per pod.

Image Credit: “In-Datacenter Performance Analysis of a Tensor Processing Unit” paper
Ironwood matters in this story because it shows where the biggest operators are placing their bets: inference gets dedicated silicon, and scaling happens as a system design problem, not a single-chip spec fight. In October, 2025 Google’s TPUs had a big moment: Anthropic signed a deal for up to 1 million TPUs – it is over a gigawatt of compute – making it one of the biggest AI infrastructure agreements to date.
Cloud inference ASICs: AWS Inferentia2
AWS takes a different hyperscaler approach: build inference silicon as a cost-performance lever inside a broader cloud platform. Inferentia is not trying to replace GPUs everywhere. It is trying to offer a cheaper, efficient inference path for a large share of production workloads.
AWS’ launch materials for Inf2 instances powered by Inferentia2 claim up to 50% better performance per watt than comparable EC2 instances. AWS’ machine learning blog describes Inferentia2 as delivering up to 4× higher throughput, up to 10× lower latency, and up to 50% better performance per watt than comparable inference-optimized instances.
And you can’t pass by AWS because it is a default deployment target for a huge share of the industry. Even if Inferentia captures only part of that market, it changes what “reasonable” inference cost looks like inside the cloud.
For in-depth explanation of these ASICs and other types of hardware the exist today, check out this guide → AI 101: Intelligence Processing Unit (IPU) and other alternatives to GPU/TPU/CPU
Deterministic low-latency inference: Groq, and why NVIDIA is paying attention
Groq built its identity around low-latency inference through a deterministic execution model. Groq’s Language Processing Unit (LPU), designed specifically for LLM inference, falls exactly into the category of ASICs.

Image Credit: What is a Language Processing Unit? (Groq Whitepaper). GPU is on the left, LPU is on the right
The notable update is that NVIDIA signed a non-exclusive licensing agreement with Groq for Groq’s inference technology, and Groq’s founder Jonathan Ross and president Sunny Madra joined NVIDIA as part of that deal.
From a competitive standpoint, this is revealing. NVIDIA is still selling the most advanced GPU platforms in the world, and it still chose to license inference technology and hire leadership from a specialized inference startup. That makes it hard to treat inference specialization as a side project.
Breaking the memory wall: d-Matrix Corsair and in-memory compute
d-Matrix represents a design philosophy that keeps gaining momentum: if inference is bandwidth-bound, put compute in or next to memory.

Image Credit: “Matrix Corsair Redefines Performance
and Efficiency for AI Inference at Scale” Whitepaper
In its Corsair technical whitepaper, d-Matrix describes Corsair as an integrated compute-and-memory PCIe card with 3200 mm² of silicon, 2 GB on-chip “Performance Memory” delivering 150 TB/s memory bandwidth, and up to 256 GB off-chip capacity memory, with dense compute figures reported in MXINT formats.
Even if you treat every marketing number with suspicion, the architectural intent is clear: reduce off-chip memory movement and make token generation cheaper and faster by keeping the hot path on-chip. This direction is also a practical response to HBM realities: it’s expensive, power hungry, and constrained by supply.
Wafer scale: Cerebras WSE-3 as the “make the chip the wafer” extreme
Cerebras remains the canonical radical form-factor story. Its WSE-3 press release lists 4 trillion transistors, 900,000 AI cores, 125 petaflops of peak AI performance, and 44 GB on-chip SRAM on a wafer-scale chip built on TSMC 5nm.

Image Credit: Cerebras Wafer-Scale Engine (WSE) datasheet
Cerebras is more often discussed in the training context, but it is relevant to inference for the same reason it is relevant everywhere: it attacks coordination overhead and memory locality by collapsing a massive amount of compute and SRAM onto a single device.
Putting it together: what “cracks in the GPU era” actually means
It is tempting to write the dramatic version where the GPU monopoly ends in a single season. The reality is messier and more interesting: the GPU era is fragmenting by workload regime, and inference is the first place where that fragmentation becomes economically meaningful.
Vera Rubin is NVIDIA’s answer: treat inference as a system problem and optimize rack-scale utilization, interconnect, and context handling as first-class design constraints. MatX is a challenger bet that tries to beat GPUs on their own turf: programmable acceleration for LLM workloads, with architectural choices aimed at utilization, latency, and long context. Taalas is the most different bet: if inference for a model becomes stable enough, you can treat the model as hardware and remove a large share of overhead, at the price of flexibility.
These bets can all be true at the same time because they are optimizing different contracts between the model and the machine.
Conclusion: the new unit of competition is cost-per-token under real constraints
The GPU era is still here, but it is becoming less monolithic.
NVIDIA is moving up the abstraction ladder: Vera Rubin frames the rack as the accelerator and treats context and interconnect as part of the product. Hyperscalers are doing the same with their own silicon and their own system-level architectures. Startups are exploring the space in between, from MatX’s programmable LLM-first ambitions to Taalas’ extreme specialization.
If you want to watch where this goes next, don’t stare at peak FLOPs. Watch cost per token, latency under long context, and the shape of real production stacks. That is where the next generation of inference hardware winners will be decided, one painfully expensive infrastructure bill at a time.
Don’t forget to watch this video about Taalas to understand this new wave in the hardware more thoroughly.
Sources and further reading
The path to ubiquitous AI | Taalas blog post
Taalas website
Chatjimmy to try Taalas
Taalas Specializes to Extremes for Extraordinary Token Speed (EETimes)
Taalas Launches Hardcore Chip With ‘Insane’ AI Inference Performance (Forbes)
AI Chip Startup MatX Raises $500 Million to Compete With Nvidia (Bloomberg)
MatX’s co-founder Reiner Pope’s tweet
MatX website
NVIDIA Kicks Off the Next Generation of AI With Rubin — Six New Chips, One Incredible AI Supercomputer (Nvidia newsroom)
NVIDIA Vera Rubin NVL72 (blog)
TPU architecture | Google Cloud documentation
An in-depth look at Google’s first Tensor Processing Unit (TPU) | Google Blog post
In-Datacenter Performance Analysis of a Tensor Processing Unit | Paper, 2017
WSE-3 Datasheet
A Comparison of the Cerebras Wafer-Scale Integration Technology with Nvidia GPU-based Systems for Artificial Intelligence | Paper
What is a Language Processing Unit? | Groq Whitepaper
d-Matrix® Corsair™ Redefines Performance and Efficiency for AI Inference at Scale | Whitepaper
Resources from Turing Post
This chip runs a “baked” Llama so fast it looks like a glitch (Taalas HC1) | Video


