The Path to Making AI Hardware Diversity Practical

GPUs are here to stay, so what does that mean for all the new AI inference hardware?

The Reality of Bringing New AI Hardware to Market

Significant AI accelerator announcements appear almost monthly, each promising better performance, lower power, and superior economics. The new hardware capabilities are indeed impressive. However, these chips need software to make AI models work. Most new AI hardware companies are building custom stacks which are necessary for hardware bring-up, early optimization, and internal development. For most of them, however, this has resulted in requiring customers to use proprietary stacks and languages they don't really want to learn or support.

One of the most immediate problems resulting from this custom software stack is the lack of day-0 model support and the ability to easily optimize the models themselves, especially frontier models. The landscape of what is available in the open-source community has changed and customers need ecosystem compatibility, not custom stacks, for broad deployments. Most deployments already have existing GPUs, so what's needed is heterogeneous infrastructure that augments them with a consistent software stack and day-0 model support.

The Market Reality: GPUs Aren't Going Anywhere

NVIDIA holds over 85% market share in AI data center GPUs. Organizations have invested billions in GPU infrastructure, many amortizing GPUs over five years to justify the spend. Additionally, new models are day-0 available on NVIDIA GPUs practically by definition of how they are developed.

Inference represents two-thirds of all AI compute, up from one-third just three years ago. Gartner projects that 55% of AI-optimized infrastructure spending will support inference workloads in 2026, reaching over 65% by 2029. Custom silicon is projected to capture 15-25% market share by 2030 alongside GPUs, not replacing them. This would argue that the fastest path to revenue for non-GPU hardware is coexistence.

What Worked for Bring-up Doesn't Scale for Deployment

The software ecosystem, especially open-source tools like PyTorch, vLLM, and Triton, is moving very fast. Many of their capabilities that enable hardware portability and performance were not available when most companies started developing their software. At the time, viable ecosystem alternatives were limited.

Bringing up silicon requires full control. Companies need working kernels, compilers, and runtimes and early customers demand performance immediately. Proprietary stacks were rational survival decisions. But what works for bring-up doesn't scale for deployment. The landscape has fundamentally changed.

The Ecosystem Has Consolidated Around PyTorch/Triton

Between 75% and 85% of AI research papers use PyTorch. PyTorch production deployments have grown from 25% overall to 55%. Major LLMs, such as GPT, Llama, and Claude, use PyTorch. PyTorch plus Triton plus vLLM equals the de facto standard for LLM development and deployment. These tools work together, evolve together, and are proven at scale.

Day-0 model support requires supporting this stack. Customers are deploying LLMs using PyTorch, Triton, and vLLM. If hardware can't run this stack, it can't run their models. Providing white-glove model porting service doesn't scale; even large companies with 20-30 kernel engineers report they can't keep up. This limiting factor results in many hardware companies being invisible to the broader market because they have not fully embraced the open-source PyTorch ecosystem.

The PyTorch Ecosystem Enables Coexistence and Performance

The ecosystem of PyTorch, vLLM, and Triton is moving quickly with major improvements announced in the last few months alone: extensions, vLLM integration, PyTorch updates. Triton Extensions were only just announced in January of 2026. Meta has been a significant contributor, helping drive ecosystem improvements and governance forward alongside the broader community and many of the key portability pieces are now in place.

Triton Extensions enable hardware-specific optimizations without forking the codebase, backend integration without fragmenting the ecosystem, and make day-0 model support achievable. Companies can still optimize aggressively for their hardware, which is precisely what Triton Extensions are designed for. This means hardware vendors can deliver the performance their chips are capable of while remaining compatible with the tools customers already use.

Winning AI Inference Market Share

When AI accelerator vendors embrace the PyTorch, Triton, and vLLM ecosystem, they can scale faster. They can use Triton Extensions to optimize for their specific hardware and performance deployments while enabling deployment alongside existing GPUs in a strategy of coexistence.

When hardware vendors build excellent hardware and excellent Triton compiler backends, their competitive advantages speak for themselves. Superior performance and economics, seamless integration with the tools organizations already use, and day-0 model support all combine to shift the conversation from 'replace your entire infrastructure' to 'augment what you have with better economics and performance where it matters.’

In the next post, we'll dive deeper into Triton Extensions specifics and how to correctly architect portable Triton.

This is Part 2 of a series on building production-ready heterogeneous AI inference clusters. Follow along as we explore the technical foundations that make true hardware diversity achievable.

Why Heterogeneous AI Inference Clusters Don't Work (Yet) ›

‹ Why CUDA No Longer Scales for New Model Support

Kernelize