POWERED BY TRITON

Run any model on any hardware

Run any model
on any hardware

Every AI accelerator needs a performant kernel layer. The Kernelize Platform helps build and optimize that layer so new model bring-up takes days, not months, while staying connected to the PyTorch and Triton ecosystem.

Partnership & Trusted by the teams at

Tenstorrent

BENEFITS

Run the latest models on your chips

Use the Kernelize Platform to rapidly update, validate, and optimize the kernel layer whenever new models are released.

New model bring-up

New model bring-up

Bring up new models faster by identifying missing kernels, unsupported operators, and performance gaps.

Bring up new models faster by identifying missing kernels, unsupported operators, and performance gaps.

Guided kernel refinement

Guided kernel refinement

Generate and validate kernels before they are promoted into production, giving a faster path to explore scheduling, fusion, memory movement, and device-specific execution.

Generate and validate kernels before they are promoted into production, giving a faster path to explore scheduling, fusion, memory movement, and device-specific execution.

Stable software stack

Stable software stack

Keep models, frameworks, and serving infrastructure stable while chip-specific model support improves underneath.

Keep models, frameworks, and serving infrastructure stable while chip-specific model support improves underneath.

HOW IT WORKS

Every AI compute stack has a kernel layer

The Kernelize Platform uses Triton to rapidly bring up, test, and explore the kernel layer, helping identify missing operators, guide AI-assisted kernel generation, and decide what should stay in Triton versus be rewritten using a native kernel language.

The Kernelize Platform uses Triton to rapidly bring up, test, and explore the kernel layer, helping identify missing operators, guide AI-assisted kernel generation, and decide what should stay in Triton versus be rewritten using a native kernel language.

Matmul

SoftMax

Matmul

SoftMax has been replaced with a Triton-generated operator to improve performance.

The Triton compiler can perform complex operator fusion to further improve performance

The Triton language allows complex kernels to be written by hand that are portable to any device supported by the Triton compiler.

Matmul

SoftMax

Matmul

SoftMax has been replaced with a Triton-generated operator to improve performance.

The Triton compiler can perform complex operator fusion to further improve performance

The Triton language allows complex kernels to be written by hand that are portable to any device supported by the Triton compiler.

WHY KERNELIZE & TRITON

An open foundation for hardware-specific optimization

The Kernelize Platform extends the open PyTorch AI software stack with the chip-specific tooling needed to make real hardware performant. It gives a common path to model support on any hardware, with optimization at the kernel layer.

example: MATRIX MULTIPLICATION

import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    a_desc, b_desc, c_desc,
    M, N, K,
    BLOCK_SIZE_K: tl.constexpr,
):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)

    m0 = pid_m * a_desc.block_shape[0]
    n0 = pid_n * b_desc.block_shape[1]

    acc = tl.zeros(
        (a_desc.block_shape[0], b_desc.block_shape[1]),
        dtype=tl.float32
    )

    for k0 in range(0, K, BLOCK_SIZE_K):
        a = a_desc.load([m0, k0])
        b = b_desc.load([k0, n0])
        acc += tl.dot(a, b)

    c_desc.store([m0, n0], acc)
import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    a_desc, b_desc, c_desc,
    M, N, K,
    BLOCK_SIZE_K: tl.constexpr,
):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)

    m0 = pid_m * a_desc.block_shape[0]
    n0 = pid_n * b_desc.block_shape[1]

    acc = tl.zeros(
        (a_desc.block_shape[0], b_desc.block_shape[1]),
        dtype=tl.float32
    )

    for k0 in range(0, K, BLOCK_SIZE_K):
        a = a_desc.load([m0, k0])
        b = b_desc.load([k0, n0])
        acc += tl.dot(a, b)

    c_desc.store([m0, n0], acc)

Complete Triton kernel example showing matrix multiplication with tensor descriptor arguments. The tensor descriptor object makes this Triton code portable to a wide range of hardware.

import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    a_desc, b_desc, c_desc,
    M, N, K,
    BLOCK_SIZE_K: tl.constexpr,
):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)

    m0 = pid_m * a_desc.block_shape[0]
    n0 = pid_n * b_desc.block_shape[1]

    acc = tl.zeros(
        (a_desc.block_shape[0], b_desc.block_shape[1]),
        dtype=tl.float32
    )

    for k0 in range(0, K, BLOCK_SIZE_K):
        a = a_desc.load([m0, k0])
        b = b_desc.load([k0, n0])
        acc += tl.dot(a, b)

    c_desc.store([m0, n0], acc)

Complete Triton kernel example showing matrix multiplication with tensor descriptor arguments. The tensor descriptor object makes this Triton code portable to a wide range of hardware.

The Kernelize Platform builds on the extensions to PyTorch, Triton, and vLLM as the open foundation for portable model support. It pushes chip-specific optimization into the kernel layer, while preserving standard APIs for the higher layers of the AI software stack.


That lets chip companies support new models through the PyTorch ecosystem instead of building a separate stack for every accelerator, while still exposing the hardware-specific capabilities needed for real performance.

PyTorch-native model support

PyTorch-native model support

Bring new hardware into the software stack model developers already use.

Bring new hardware into the software stack model developers already use.

Extension-based architecture

Extension-based architecture

Add chip support through PyTorch, Triton, and vLLM extension points instead of maintaining a separate stack.

Add chip support through PyTorch, Triton, and vLLM extension points instead of maintaining a separate stack.

Stable higher-level APIs

Stable higher-level APIs

Keep model, serving, and deployment code consistent across accelerators.

Keep model, serving, and deployment code consistent across accelerators.

Triton kernel foundation

Triton kernel foundation

Use Triton for kernel bring-up, exploration, validation, and optimization.

Use Triton for kernel bring-up, exploration, validation, and optimization.

Hardware-specific execution

Hardware-specific execution

Expose each device’s compiler, memory, runtime, and execution strategies where they matter.

Expose each device’s compiler, memory, runtime, and execution strategies where they matter.

Kernel layer refinement

Kernel layer refinement

Generate and maintain the reusable kernel layer each chip needs for new models.

Generate and maintain the reusable kernel layer each chip needs for new models.

Generate and maintain the reusable kernel layer each chip needs for new models.

COMPARISON

New model support without starting over

New AI models introduce new operators, kernel patterns, and performance bottlenecks. Kernelize gives a faster path from model release to production-ready support, with kernel-level optimization built into the bring-up process.

Before Kernelize

New models require custom kernel and compiler work with no functional reference

Kernel gaps surface late in customer evaluations

Scarce experts fight model-by-model issues by hand

With Kernelize

Rapid support for new model releases

Earlier detection of missing and underperforming kernels

Guided kernel generation and refinement

Kernel-level optimization beneath stable AI software stack

A path from model coverage to workload-specific tuning

Get Started

Talk to the Kernelize team

Talk to Kernelize about refining the kernel layer your chip needs to run more models, improve performance, and support production inference workloads.

Kernelize

Copyright Kernelize 2025. All rights reserved.