AI is Revolutionizing Compilers Not Replacing Them

The future of AI-driven model development depends on well-structured open compiler workflows and systems that keep improving over time.

Introduction

The rapid progress of AI models in code generation has led to a logical hypothesis: perhaps we can bypass the compiler entirely and generate low-level kernels directly. AI models can write code, select optimizations, and even produce low-level kernels, so it is tempting to assume we will skip the compiler and let AI generate the final code directly. At Kernelize, we think the future looks different. AI-driven model development requires a strong, structured compiler stack with well-defined IRs and feedback loops.

Direct generation promises an escape from the 'black art' of manual kernel optimization. To a field plagued by slow compiler cycles and a scarcity of low-level engineers, the shortcut is undeniably appealing. Every new model creates another bring-up effort, another performance fire drill, and another reminder that hand-optimizing a few kernels does not scale to broad model support. So when someone says AI will solve this by generating CUDA or PTX directly, it sounds like the shortcut everyone has been waiting for.

The strongest AI systems operate inside frameworks that define what is valid, what is executable, what is correct, and what counts as an improvement. Code generation works because compilers, tests, and runtimes exist. Optimization works because there is a system that can measure whether one result is actually better than another. AI gets stronger when it has something to push against.

Kernel generation is no different.

Kernel generation alone will not scale model support

The current wave of interest in AI-generated kernels starts from reasonable observations: kernel development is painful, and AI is getting better at generating code. From there, it is not a huge leap to imagine a model that emits low-level kernels directly and learns its way toward performance.

That can work for a few cases, and you might get a good matmul or a sparse attention kernel. You might even get an impressive demo on a narrow benchmark. While impressive in isolation, these 'heroic kernels' fail to address the systemic scaling challenges hardware teams actually face.

The real problem is that every new model introduces new shapes, new operator combinations, new fusion opportunities, and new failure modes. Every hardware target adds constraints around memory, scheduling, and execution. The challenge is building a workflow that keeps delivering correct, performant support for a constantly moving target, not finding a few heroic kernels.

Direct generation starts to break down at scale. If there is no strong intermediate structure, no clear semantic grounding, and no staged refinement from high-level computation down to hardware-specific implementation, then each success stands alone. You are collecting outputs in isolation, not building a system that improves over time.

Compilers are the framework AI needs

Many kernel generation efforts feel backwards from an engineering perspective. They treat the compiler as the obstacle, when the compiler is actually the framework that makes systematic optimization possible.

Compilers provide executable abstractions, deterministic lowerings, and places to validate both correctness and performance. They separate concerns across levels so that transformations can be reasoned about instead of guessed at. Most importantly, they create feedback. If a change is wrong, you have a better chance of seeing where it became wrong. If it is slow, you can often see where the bad decision entered the pipeline.

This is also why we do not think the future is one magic IR that solves everything. Real systems are layered. PyTorch captures computation at one level. Triton captures kernel intent at another. Hardware-specific codegen and runtime integration happen lower still. The right answer is not to flatten that stack. The right answer is to make the lowerings between those layers stronger, more extensible, and more learnable.

The strong structure in a compiler is a better place to apply AI than asking a model to leap straight from intent to PTX.

AI should fill compiler gaps the way humans do today

If you look at how high-performance kernel work gets done today, the optimal approach is already clear. Engineers refine layer-by-layer. They start from framework-level behavior, identify a bottleneck, drop into a kernel language, refine the implementation, and sometimes go lower still. PyTorch to Triton to CUDA is a familiar path because it mirrors the structure of the problem.

In practice, humans are filling gaps in the compiler pipeline. They inspect intermediate behavior, decide where to intervene, rewrite an implementation at a lower level, validate correctness, and iterate on performance. We already know staged refinement works because it is how the best engineers operate today.

Triton autotuning is a simple example of the right direction. It does not pretend to replace the compiler. It works inside a structured search space, evaluates executable candidates, and keeps the best result. And the Triton compiler already supports a broader family of portable languages across different levels of abstraction. Helion is more abstract, exposing richer search parameters that can generate Triton and other lower-level languages. Triton sits in the middle as a portable kernel language with explicit performance control. Gluon is lower-level, giving developers a place to represent more hardware-facing details without stepping outside the compiler stack. This is not the final form of AI in compiler systems, but it is the right shape: structured exploration with real feedback across multiple layers of lowering.

Open systems will compound faster than closed ones

This is also why open source matters so much here. A closed system can look strong in the short term, especially if it is built around a narrow workflow or a carefully chosen benchmark. But compilers do not win because of one demo. They win because they become the place where improvements accumulate.

We have seen this play out before. LLVM became foundational because it gave compiler builders shared structure and extension points. PyTorch grew because it became the framework where the ecosystem actually worked. In each case, the long-term advantage came from extensibility and compounding progress, not from one isolated technical leap.

Kernel development is heading in the same direction. Closed systems have AI too, but they are learning alone. The open ecosystem is learning in public, across more workloads, more hardware targets, and more optimization paths. Over time, that matters more than any one company’s attempt to generate low-level code in isolation.

Chess engines show why AI needs structure

A good analogy here is chess engines. The strongest engines are not pure black boxes producing moves out of thin air. They combine a strict framework of legal moves, strong evaluation, search, and precomputed endgame knowledge when the problem becomes tractable enough for exact solutions. The resulting system is one of the best examples of AI exceeding human abilities.

Compiler systems should look more like that. The compiler defines the legal moves. The layered representations define the board. AI helps evaluate choices and steer optimization. Reusable kernels and libraries preserve known-good results instead of forcing the system to rediscover them every time.

The strongest systems pair AI with structure.

Kernelize is building the foundation for AI-driven model development

At Kernelize, we are turning this structural philosophy into an engineering reality. In our previous posts, especially on plugin-based compiler extensions, we have argued that the future of model support is not a giant rewrite. It is an extensible system built on top of the open compiler and runtime layers that are already becoming standard.

A plugin-based architecture provides the stable ground AI needs to iterate without breaking the system.

The future lies in a distributed system of lowerings, rather than a monolithic compiler or an all-knowing model. We think it is a structured system of lowerings where each stage is executable, feedback flows between levels, and optimization can happen where it is most effective. AI fits naturally into that system. It improves decisions, accelerates iterations, and helps scale what still depends too heavily on expert humans. In this framework, AI doesn't replace the engineer; it automates the expertise that currently limits their scale.

Tightly integrating AI into a modern compiler stack is our product strategy because it is the only way to scale kernel development. Hardware vendors and their customers do not need only isolated miracles. They need a workflow that can keep up with the pace of new models.

This structured approach offers a universal benefit: portability across a diverse landscape of AI hardware. By guiding the compiler through these layers, AI ensures that every new chip can immediately support the next generation of models.

We believe this is the architecture that turns kernel development from a manual bottleneck into a scalable engine for our entire industry.

Why CUDA No Longer Scales for New Model Support ›

‹ A Compiler-Inspired Platform to Run Any Model on Any Chip

Kernelize