Kernelize

Products

Extensions

Pricing

Blog

About Us

Careers

NPU Hardware Companies

Contact

Kernelize

Star

Run any model on any chip

Run any model
on any hardware

Kernelize builds the inference intelligence layer that makes every chip ready for any model.

Products

Contact sales

Partnership & Trusted by the teams at

Tenstorrent

BENEFITS

Run the latest models on your chips

A repeatable path to bring up new models, meet production requirements, and move more evaluations toward deployment.

Faster model bring-up

Identify and fix missing kernels, unsupported operators, and performance gaps.

Meet production requirements

Break issues into manageable tasks that can be fixed and tracked against customer requirements.

Scale customer evaluations

Move evaluations forward without pulling scarce compiler and kernel experts into every step.

platform foundation

Model to chip portability

Kernelize uses Triton as a portable kernel layer, giving teams an executable starting point even for chips that do not directly support Triton.

Matmul

SoftMax

Matmul

SoftMax has been replaced with a Triton-generated operator to improve performance.

The Triton compiler can perform complex operator fusion to further improve performance

The Triton language allows complex kernels to be written by hand that are portable to any device supported by the Triton compiler.

Matmul

SoftMax

Matmul

SoftMax has been replaced with a Triton-generated operator to improve performance.

The Triton compiler can perform complex operator fusion to further improve performance

The Triton language allows complex kernels to be written by hand that are portable to any device supported by the Triton compiler.

WHY KERNELIZE & Triton

An open foundation for hardware-specific optimization

Give every team a shared path to support, validate, and optimize new models without locking the platform to a single kernel language or compiler path.

example: MATRIX MULTIPLICATION

import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    a_desc, b_desc, c_desc,
    M, N, K,
    BLOCK_SIZE_K: tl.constexpr,
):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)

    m0 = pid_m * a_desc.block_shape[0]
    n0 = pid_n * b_desc.block_shape[1]

    acc = tl.zeros(
        (a_desc.block_shape[0], b_desc.block_shape[1]),
        dtype=tl.float32
    )

    for k0 in range(0, K, BLOCK_SIZE_K):
        a = a_desc.load([m0, k0])
        b = b_desc.load([k0, n0])
        acc += tl.dot(a, b)

    c_desc.store([m0, n0], acc)

import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    a_desc, b_desc, c_desc,
    M, N, K,
    BLOCK_SIZE_K: tl.constexpr,
):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)

    m0 = pid_m * a_desc.block_shape[0]
    n0 = pid_n * b_desc.block_shape[1]

    acc = tl.zeros(
        (a_desc.block_shape[0], b_desc.block_shape[1]),
        dtype=tl.float32
    )

    for k0 in range(0, K, BLOCK_SIZE_K):
        a = a_desc.load([m0, k0])
        b = b_desc.load([k0, n0])
        acc += tl.dot(a, b)

    c_desc.store([m0, n0], acc)

Explore products

Complete Triton kernel example showing matrix multiplication with tensor descriptor arguments. The tensor descriptor object makes this Triton code portable to a wide range of hardware.

import triton
import triton.language as tl

@triton.jit
def matmul_kernel(
    a_desc, b_desc, c_desc,
    M, N, K,
    BLOCK_SIZE_K: tl.constexpr,
):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)

    m0 = pid_m * a_desc.block_shape[0]
    n0 = pid_n * b_desc.block_shape[1]

    acc = tl.zeros(
        (a_desc.block_shape[0], b_desc.block_shape[1]),
        dtype=tl.float32
    )

    for k0 in range(0, K, BLOCK_SIZE_K):
        a = a_desc.load([m0, k0])
        b = b_desc.load([k0, n0])
        acc += tl.dot(a, b)

    c_desc.store([m0, n0], acc)

Explore products

Complete Triton kernel example showing matrix multiplication with tensor descriptor arguments. The tensor descriptor object makes this Triton code portable to a wide range of hardware.

Starts from the model

Uses the model to identify the compiler, kernel, and runtime work needed for your hardware.

Explores with portable kernels

Triton provides an executable starting point, even when production execution uses a native kernel language.

Targets production software

Connects to chip-specific compilers, runtimes, memory systems, and kernel implementations.

Makes expert work repeatable

Breaks full-model bring-up into guided tasks that customer engineering, GTM, and compiler teams move through together.

Validates before deployment

Tests fixes in isolation, reintegrates into the model, and checks against the requirements needed for production inference.

Leverages open source

Build on the open PyTorch, Triton, and vLLM stack while supporting the proprietary paths required to make real chips performant.

Generate and maintain the reusable kernel layer each chip needs for new models.

LEARN MORE ABOUT KERNELIZE

COMPARISON

Move from one-off evaluation work to production deployment

Replace reactive model support with a structured path from first bring-up to production deployment.

Before Kernelize

Reactive model support

Bottlenecks around scarce experts

Unclear path to production

With Kernelize

Repeatable model bring-up

Faster customer evaluations

Actionable tasks without involving experts

Targeted expert involvement

Production-ready validation

Get Started

Talk to the Kernelize team

Learn how to bringing the latest models to your chips with a repeatable path from initial bring-up to production deployment.

Kernelize

Run any model on any chip

Run any model on any hardware

Run the latest models on your chips

Faster model bring-up

Faster model bring-up

Meet production requirements

Meet production requirements

Scale customer evaluations

Scale customer evaluations

Model to chip portability

An open foundation for hardware-specific optimization

Starts from the model

Starts from the model

Explores with portable kernels

Explores with portable kernels

Targets production software

Targets production software

Makes expert work repeatable

Makes expert work repeatable

Validates before deployment

Validates before deployment

Leverages open source

Leverages open source

Move from one-off evaluation work to production deployment

Before Kernelize

With Kernelize

Talk to the Kernelize team

Run any model
on any hardware