POWERED BY TRITON

Our compiler

brings Triton

to your chips

Our compiler brings Triton to your chips

New chips make low cost AI inference possible. Kernelize makes it practical.

Portable compute built on Triton, a stable kernel language with chip-specific optimization underneath.

Partnership and Trusted by the team at

BENEFITS

Built for Efficient, Portable Inference

Built on Triton and open, industry-standard infrastructure, aligned with the tools and ecosystems developers already use.

Open, standards-aligned

Open, standards-aligned

Designed to integrate cleanly with existing frameworks without forks or proprietary APIs.

Designed to integrate cleanly with existing frameworks without forks or proprietary APIs.

Stable interface

Stable interface

Uses a stable compiler language to separate model evolution from hardware-specific optimization.

Uses a stable compiler language to separate model evolution from hardware-specific optimization.

Flexible optimization

Flexible optimization

Adapts optimization strategies to each device and workload without changing models or higher-level software.

Adapts optimization strategies to each device and workload without changing models or higher-level software.

HOW IT WORKS

How Triton works

Triton expresses performance-critical kernels using a stable, high-level language that is compiled into each device’s existing internal representations.

Triton expresses performance-critical kernels using a stable, high-level language that is compiled into each device’s existing internal representations.

Matmul

SoftMax

Matmul

SoftMax has been replaced with a Triton-generated operator to improve performance.

The Triton compiler can perform complex operator fusion to further improve performance

The Triton language allows complex kernels to be written by hand that are portable to any device supported by the Triton compiler.

Matmul

SoftMax

Matmul

SoftMax has been replaced with a Triton-generated operator to improve performance.

The Triton compiler can perform complex operator fusion to further improve performance

The Triton language allows complex kernels to be written by hand that are portable to any device supported by the Triton compiler.

Matmul

SoftMax

Matmul

SoftMax has been replaced with a Triton-generated operator to improve performance.

The Triton compiler can perform complex operator fusion to further improve performance

The Triton language allows complex kernels to be written by hand that are portable to any device supported by the Triton compiler.

WHY KERNELIZE & TRITON

Optimization becomes portable

The Triton language is compiled into device-specific code, making efficient inference possible across different hardware.

example: MATRIX MULTIPLICATION

import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    a_desc, b_desc, c_desc,
    M, N, K,
    BLOCK_SIZE_K: tl.constexpr,
):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)

    m0 = pid_m * a_desc.block_shape[0]
    n0 = pid_n * b_desc.block_shape[1]

    acc = tl.zeros(
        (a_desc.block_shape[0], b_desc.block_shape[1]),
        dtype=tl.float32,
    )

    for k0 in range(0, K, BLOCK_SIZE_K):
        a = a_desc.load([m0, k0])
        b = b_desc.load([k0, n0])
        acc += tl.dot(a, b)

    c_desc.store([m0, n0], acc)
import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    a_desc, b_desc, c_desc,
    M, N, K,
    BLOCK_SIZE_K: tl.constexpr,
):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)

    m0 = pid_m * a_desc.block_shape[0]
    n0 = pid_n * b_desc.block_shape[1]

    acc = tl.zeros(
        (a_desc.block_shape[0], b_desc.block_shape[1]),
        dtype=tl.float32,
    )

    for k0 in range(0, K, BLOCK_SIZE_K):
        a = a_desc.load([m0, k0])
        b = b_desc.load([k0, n0])
        acc += tl.dot(a, b)

    c_desc.store([m0, n0], acc)

Complete Triton kernel example showing matrix multiplication with tensor descriptor arguments. The tensor descriptor object makes this Triton code portable to a wide range of hardware.

import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    a_desc, b_desc, c_desc,
    M, N, K,
    BLOCK_SIZE_K: tl.constexpr,
):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)

    m0 = pid_m * a_desc.block_shape[0]
    n0 = pid_n * b_desc.block_shape[1]

    acc = tl.zeros(
        (a_desc.block_shape[0], b_desc.block_shape[1]),
        dtype=tl.float32,
    )

    for k0 in range(0, K, BLOCK_SIZE_K):
        a = a_desc.load([m0, k0])
        b = b_desc.load([k0, n0])
        acc += tl.dot(a, b)

    c_desc.store([m0, n0], acc)

Complete Triton kernel example showing matrix multiplication with tensor descriptor arguments. The tensor descriptor object makes this Triton code portable to a wide range of hardware.

The Kernelize platform builds on Triton, using Triton Extensions to simplify hardware support while ensuring kernels remain portable across devices.

Stable kernel language

Stable kernel language

Triton provides a stable compiler language for expressing performance-critical kernels as hardware and models evolve.

Triton provides a stable compiler language for expressing performance-critical kernels as hardware and models evolve.

Hardware-aware compilation

Hardware-aware compilation

Triton kernels are compiled into device-specific code, allowing each architecture to apply its own strategies.

Triton kernels are compiled into device-specific code, allowing each architecture to apply its own strategies.

Default in PyTorch

Default in PyTorch

Triton is already in PyTorch compilation paths, simplifying integration with future releases.

Triton is already in PyTorch compilation paths, simplifying integration with future releases.

Heterogeneous clusters

Heterogeneous clusters

The Kernelize platform standardizes how Triton support is added and maintained across mixed hardware in production.

The Kernelize platform standardizes how Triton support is added and maintained across mixed hardware in production.

Portable kernels

Portable kernels

Kernelize provides structure and tooling to keep kernels portable across devices as hardware capabilities evolve.

Kernelize provides structure and tooling to keep kernels portable across devices as hardware capabilities evolve.

Upstream version alignment

Upstream version alignment

Kernelize aligns releases with official Triton and PyTorch versions to ensure compatibility as compilers and frameworks evolve.

Kernelize aligns releases with official Triton and PyTorch versions to ensure compatibility as compilers and frameworks evolve.

COMPARISON

AI Inference, Simplified

A unified approach to running inference across diverse hardware.


Kernelize uses Triton Extensions to define each chip-specific optimization strategy, while higher-level software remains unchanged.

Before Kernelize

Before Kernelize

Inference software is tightly coupled to one chip

New hardware requires one-off kernels and optimizations

Heterogeneous clusters fragment workflows and tooling

With Kernelize

With Kernelize

A stable kernel language across chips

Chip-specific optimization isolated in Triton Extensions

One consistent approach across heterogeneous clusters

Higher-level software remains unchanged

Releases aligned with Triton and PyTorch

Get Started

Talk to the Kernelize team

Explore how the Kernelize platform builds on Triton, to support efficient, portable inference across heterogeneous clusters.

Kernelize

Copyright Kernelize 2025. All rights reserved.