Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection [ISSTA 2026]

Abstract

Tile-based programming frameworks are increasingly adopted to write high-performance GPU kernels in domains such as deep learning and scientific computing. While these frameworks enhance productivity and hardware utilization, their multi-stage compilation pipelines introduce distinct code generation bugs that are tightly coupled to input shapes, data types, and backend targets. These bugs often manifest as silent correctness or performance issues, making them difficult to detect using existing compiler testing tools. Additionally, the unique programming conventions of tile domain specific languages complicate root cause identification, while fixing such bugs demands specialized knowledge of tile abstractions and compilation pipelines. Despite the growing adoption of tile-based systems, their code generation bugs remain largely unexplored.
This paper presents the first systematic study of tile-program code generation bugs. We analyze 301 realworld bug reports from GitHub and categorize their root causes, symptoms, input patterns, test oracles that trigger these bugs and the strategies used to fix them. Our study provides foundational insights for building debugging, testing, and repair tools tailored to tile-based compiler infrastructures.

Bug Dataset

Bug Causes Showcase

We identified 301 codegen bugs and classified them into six primary categories. Below we present each category with representative real-world examples, including code from the actual GitHub issues, symptom descriptions, and root cause analysis.

4.1 Control Flow and Scheduling Bugs

5.31% — 16 bugs
These bugs arise when the compiler derives control logic, such as boundary guards, predicate masks, and synchronization placement, from high-level tile abstractions and generates incorrect control behavior during lowering. They typically manifest only under specific tile shapes, boundary configurations, or warp layouts.

4.1.1 Branch Predication Bugs

Description

Branch predication bugs arise when control flow expressed over logical tiles is incorrectly lowered to operate over physical threads. Incorrect lowering of predicates can cause valid tile instances to be skipped or incorrectly masked.

Example: Triton #5265
# Simplified from the actual issue #5265 reproducer
@triton.jit
def repro_kernel(q_ref, k_ref, v_ref, output_ptr):
    offsets64  = tl.arange(0, 64)
    offsets128 = tl.arange(0, 128)

    q  = tl.load(q_ref + (offsets64[:, None] * 128 + offsets128[None, :]))   # [64, 128]
    k  = tl.load(k_ref + (offsets128[:, None] * 64 + offsets64[None, :]))    # [128, 64]
    qk = tl.dot(q, k).to(tl.bfloat16)                                       # [64, 64]

    v  = tl.load(v_ref + (offsets64[:, None] * 128 + offsets128[None, :]))   # [64, 128]
    o  = tl.dot(qk, v)                                                       # [64, 128]

    tl.store(output_ptr + (offsets64[:, None] * 128 + offsets128[None, :]), o.to(tl.bfloat16))

# ✅ num_warps=4 — compiles and executes correctly
# ❌ num_warps=8 — triggers assertion failure in LinearLayout.cpp::reshapeOuts(...)
repro_kernel[(1, 1)](q, k, v, output, num_warps=8, num_ctas=1, num_stages=3)
Root Cause

The failure is a warp-configuration-sensitive regression in Triton's Linear Layout pass. The commit that introduced stmatrix support through linear layouts (49266aa) contains assumptions about tile-to-warp mapping that hold for num_warps=4 but break for num_warps=8. Because the Linear Layout pass is responsible for deriving how logical tiles are distributed across physical warps and threads, including the predication and masking decisions that depend on this mapping, the invalid layout causes the compiler to fail before correct control flow can be generated.

4.1.2 Instruction Scheduling Bugs

Description

Instruction scheduling bugs arise when backend reordering violates compiler-synthesized tile pipeline invariants, rather than explicit source-level data dependencies.

Example: Triton #6750
# Schematic illustration (not actual code from the issue)

# AMD backend's createLocalPrefetchSchedule() assumes this IR order:
# ┌──────────────────────────────────────────────────┐
# │  local_stores   ← write prev-iteration data to LDS│
# │  global_loads   ← prefetch next tile from GMEM    │
# │  compute        ← process current tile            │
# │  local_loads    ← read prefetched data from LDS   │
# └──────────────────────────────────────────────────┘

# ❌ After ReorderInstructionsPass::scheduleGlobalLoadLocalStore():
# ┌──────────────────────────────────────────────────┐
# │  global_loads   ← moved ahead of local_stores!    │
# │  local_stores   ← moved behind global_loads!      │
# │  compute / local_loads                            │
# └──────────────────────────────────────────────────┘
# Additionally, sync/barrier before local_stores causes
# sched.group masks to no longer work as intended.
Root Cause

The reordering pass and the prefetch scheduling pass operate under conflicting assumptions about IR ordering. The reorder pass rearranges instructions without preserving the ordering invariant that createLocalPrefetchSchedule depends on.

4.1.3 Warp Control Bugs

Description

Warp control bugs arise from incorrect synthesis or propagation of warp-level execution metadata across compiler stages.

Example: Triton #2658
# Schematic illustration (specific values are illustrative)

# TritonGPU IR invariant (as described in the issue):
# For any tensor with a #blocked layout where warpsPerCta = [x, y]:
#     x * y  MUST equal  module's triton_gpu.num-warps

# WSMaterialization pass modifies triton_gpu.num-warps
# but does NOT modify tensor layouts within the module.
#
# → warpsPerCta fields in tensor layouts become stale
# → The invariant (x * y == num-warps) is broken
# → The resulting IR is invalid
#
# Note: the issue author points out that simply updating warpsPerCta
# may NOT be the correct fix—a new layout type may be needed.
Root Cause

WSMaterialization performs a partial update, modifying num-warps without synchronizing warpsPerCta fields. This breaks a layout invariant that was previously unchecked by the IR verifier. The issue author suggests the representation may need to be extended to express warp specialization groups.

4.2 IR Construction and Transformation Bugs

16.28% — 49 bugs
These bugs arise when the compiler incorrectly constructs or transforms tile-aware IRs, violating semantic invariants that govern how logical tiles map to physical memory and execution resources.

4.2.1 IR Construction Bugs

Description

IR construction bugs arise during front-end lowering where non-determinism or inconsistency in the creation of tile-level IR entities produces semantically unstable IRs.

Example: TileLang #313
# TileLang FlashAttention backward kernel (func.script() dump)
            # T.handle ordering before T.block_attr appears to be random
            
            # ─── Compilation Run 1 ───
            dv_shared = T.handle("float16", "shared.dyn")
            K_shared  = T.handle("float16", "shared.dyn")
            dk_shared = T.handle("float16", "shared.dyn")
            T.block_attr({"layout_map": {
                dQ.data:   metadata["tl.Layout"][0],
                dv_shared: metadata["tl.Layout"][1],
                K_shared:  metadata["tl.Layout"][2],
                dk_shared: metadata["tl.Layout"][3]}})
            
            # ─── Compilation Run 2 (same source code!) ───
            K_shared  = T.handle("float16", "shared.dyn")   # ← different order!
            dv_shared = T.handle("float16", "shared.dyn")   # ← different order!
            dk_shared = T.handle("float16", "shared.dyn")
            T.block_attr({"layout_map": {
                K_shared:  metadata["tl.Layout"][0],         # ← bindings shifted!
                dQ.data:   metadata["tl.Layout"][1],
                dv_shared: metadata["tl.Layout"][2],
                dk_shared: metadata["tl.Layout"][3]}})
            
            # → tilelang.compile fails to load cached kernel
            # → logically identical kernels treated as distinct artifacts
Root Cause

Non-deterministic creation order of shared-memory tile handles causes layout_map bindings to differ across compilations, defeating kernel caching. The compiler lacks a canonical ordering step for tile handle construction.

4.2.2 IR Transformation Bugs

Description

IR transformation bugs arise when mid-end optimization passes rewrite tile IRs in ways that invalidate tile-specific execution invariants, such as boundary conditions and masked regions.

Example: Halide #8667
// Minimal reproducer from the issue
            #include <Halide.h>
            using namespace Halide;
            
            int main() {
                Var x{"x"};
                Func f{"f"}, g{"g"};
            
                f(x) = select(x == 0, x + 1, undef<int>());
                g(x) = select(x == 0, f(x), -f(1 - x));
            
                g.trace_stores();
            
                Buffer<int> output = g.realize({4});
                // Expected: g(0)=1, g(1)=-1, ...
                // Actual output:
                //   Store g.0(1) = -1
                //   g(0) is never computed  ← ❌ BUG
            }
Root Cause

The optimizer exploits undef values to prune a branch of select, inadvertently eliminating the x == 0 case where the computation is well-defined. This causes the compiler to silently reduce the effective iteration domain, skipping points that the original tiled logic correctly covers.

4.3 Tile Mapping and Launch Bugs

6.31% — 19 bugs
These bugs arise at the stage where logical tiles are concretized into physical execution parameters, including grid dimensions, block sizes, and thread–tile associations.

4.3.1 Launch Configuration Bugs

Description

Launch configuration bugs arise when the compiler miscomputes launch parameters derived from tile-level abstractions.

Example: PyTorch #141121
# Reproducer from the issue
            import torch, functools
            from torch._inductor.utils import run_and_get_triton_code
            from torch._inductor import config
            
            config.triton.max_tiles = 3
            config.triton.prefer_nd_tiling = True
            
            full_size, view_size = (5, 5, 5, 5, 5), (3, 3, 5, 3, 5)
            
            def get_input():
                full = torch.randn(full_size).to("cuda")
                return torch.as_strided(full, view_size, full.stride())
            
            a, b = get_input(), get_input()
            opt_fn = torch.compile(functools.partial(torch.add))
            code = run_and_get_triton_code(opt_fn, a, b)
            
            # ❌ BackendCompilerFailed / AssertionError
            # The LoopBody iteration prefix "z" conflicts with the
            # z-dimension range tree prefix (banned in the symbol system).
            # No corresponding ZBLOCK is defined.
            # → Internal symbol-prefix collision during code generation
Root Cause

The codegen infrastructure assumes iteration prefixes for tile groups won't collide with reserved dimension prefixes. Increasing the tile group count to 3 introduces the z prefix which overlaps with the reserved z-dimension namespace.

4.3.2 Thread-Block Mapping Bugs

Description

Thread–block mapping bugs arise when the compiler emits incorrect logic for mapping logical tiles to physical grid and block indices.

Example: PyTorch #157018
# Reproducer from the issue
            import torch
            from torch.nn.attention.flex_attention import flex_attention
            
            flex_attention = torch.compile(flex_attention, dynamic=True)
            
            DEVICE = 'cuda:0'
            dtype = torch.float32
            
            B, H, L, C = (22720, 3, 64, 32)   # ❌ Error
            # B, H, L, C = (20480, 3, 64, 32) # ✅ No Error
            # B, H, L, C = (20480, 4, 64, 32) # ❌ Error
            
            q = torch.rand(B, H, L, C, dtype=dtype, device=DEVICE)
            k = torch.rand(B, H, L, C, dtype=dtype, device=DEVICE)
            v = torch.rand(B, H, L, C, dtype=dtype, device=DEVICE)
            
            attn = flex_attention(q, k, v)
            # RuntimeError: Triton Error [CUDA]: invalid argument
            # at: self.launch(gridX, gridY, gridZ, ...)
Root Cause

When batch and head dimensions are large enough, the compiler maps them onto grid axes whose product exceeds CUDA hardware launch limits. The tile-to-grid mapping logic does not check for overflow when flattening multi-dimensional tile iteration spaces into physical grid dimensions, causing the kernel launch to fail with an invalid argument error.

4.4 Memory Bugs

19.27% — 58 bugs
These bugs arise when the compiler incorrectly derives how memory behavior is structured across tiles and tile execution stages, including layouts, buffer lifetimes, and synchronization boundaries.

4.4.1 Indexing and Strides Bugs

Description

Indexing and strides bugs arise when the compiler incorrectly synthesizes address computations from logical tile indices.

Example: Triton #443
# Issue: "Segfault in dds_matmul"
# Reproducer triggers _dsd_kernel via blocksparse matmul backward pass:

from triton.ops.blocksparse.matmul import _matmul

# ctx contains pre-built LUT tensors for the sparse layout
# dc is the upstream gradient tensor
_matmul.backward(ctx=ctx, dc=dc)

# Under cuda-memcheck (PYTORCH_NO_CUDA_MEMORY_CACHING=1):
#   "Invalid __global__ read of size 4"
#   "    at 0x00000670 in _dsd_kernel"
#   "    by thread (97,0,0) in block (0,17,0)"
#   "    Address 0x7f1306db0400 is out of bounds"
#
# The segfault is rare but reliably reproduced under cuda-memcheck.
# Adding logging or capturing intermediate state perturbs timing
# enough to make the bug disappear (heisenbug behavior).
Root Cause

The _dsd_kernel (dense-sparse-dense matmul) performs out-of-bounds global memory reads at runtime due to incorrect address computation within tile-local index arithmetic for sparse access patterns. The issue author notes heisenbug behavior: perturbations to timing or memory layout suppress the fault.

4.4.2 Resource Allocation Bugs

Description

Resource allocation bugs arise when the compiler incorrectly infers the lifetime or reuse of registers or shared-memory buffers across tile phases.

Example: TileLang #359
# Simplified from the full deepgemm reproducer in the issue
# Config: thread_num=256, warp specialization enabled, Hopper GPU

for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=num_stages):
    T.copy(A[by * block_M, k * block_K], A_shared)
    T.copy(B[bx * block_N, k * block_K], B_shared)

    # Load scale into shared memory via SIMT copy
    for i in T.Parallel(block_M):
        Scale_C_shared[i] = scales_a[by * block_M + i, k] * scales_b[bx, k]

    T.gemm(A_shared, B_shared, C_local, transpose_B=True, policy=policy)

    for i, j in T.Parallel(block_M, block_N):
        C_local_accum[i, j] += C_local[i, j] * Scale_C_shared[i]

# ❌ Deadlocks on NVIDIA Hopper GPU with 8 warps
# pass_configs: tl.disable_warp_specialized = False
Root Cause

The issue reports a deadlock when running a deepgemm kernel with 8 warps and warp specialization on Hopper. Our analysis attributes this to premature register deallocation (warpgroup_reg_dealloc) during tile-parallel copy: the compiler frees staging registers in the global → register → shared path before the SIMT copy completes, corrupting shared memory values and causing the subsequent warp-synchronized operations to hang.

4.4.3 Ordering and Caching Bugs

Description

Ordering and caching bugs arise from incorrect kernel cache reuse or missing synchronization. A common pattern is the cache's specialization key omitting semantically relevant parameters.

Example: NVIDIA Warp #639
# Issue: "Kernels that use the Cholesky solver might fail
#         when block_dim changes"
#
# cuMathDx needs block_dim at compile time for Cholesky solver.
# Compiled result is stored as an LTO file.

# ─── First launch ───
wp.launch(cholesky_kernel, dim=N, block_dim=128)
# → Compiles LTO with block_dim=128, caches it ✅

# ─── Second launch (different block_dim) ───
wp.launch(cholesky_kernel, dim=N, block_dim=256)
# → Kernel hash does NOT include block_dim
# → Cache returns LTO compiled for block_dim=128
# → ❌ Incorrect behavior: wrong block_dim assumptions
#
# Note: the issue does not include reproducer code;
# the above is a schematic illustration of the reported behavior.
Root Cause

The kernel caching mechanism does not include block_dim in its hash key. When block_dim changes between launches, the cache returns an LTO file compiled with incorrect block dimension assumptions, leading to wrong results or failures.

4.5 Type and Operator Bugs

48.84% — 147 bugs
The largest category. These bugs arise when the compiler synthesizes arithmetic behavior (i.e., value propagation, type promotion and operator logic) based on tile shapes, masks, and fusion strategies, with subtle mismatches silently corrupting results.

4.5.1 Special-Value Handling Bugs

Description

Special-value handling bugs occur when the compiler mishandles floating-point edge cases (NaN, denormals, signed zeros) during tile-level execution.

Example: NVIDIA Warp #688
# Reproducer from the issue
import numpy as np
import warp as wp

def tile_cholesky(M, L, tilesize):
    @wp.kernel
    def cholesky(M: wp.array(dtype=wp.float32, ndim=2),
                 L: wp.array(dtype=wp.float32, ndim=2)):
        M_tile = wp.tile_load(M, shape=(tilesize, tilesize))
        L_tile = wp.tile_cholesky(M_tile)
        wp.tile_store(L, L_tile)
    wp.launch_tiled(cholesky, dim=(1,), inputs=[M, L], block_dim=32)

dim = 5;  tilesize = 4  # tile 4×4, but source array is 5×5
Mwp = wp.array(np.eye(dim) * 2.0, dtype=wp.float32)
Lwp = wp.zeros((dim, dim), dtype=wp.float32)
tile_cholesky(Mwp, Lwp, tilesize)

err = np.linalg.cholesky(np.eye(dim) * 2.0)[:tilesize, :tilesize] \
    - Lwp.numpy()[:tilesize, :tilesize]
# ❌ Unexpected results including NaNs
Root Cause

wp.tile_load does not correctly handle 2D source arrays with strides incompatible with the tile shape. When the source array is 5×5 but the tile is 4×4, the mismatch in memory layout causes some tile lanes to read uninitialized or out-of-bounds values, producing NaNs. The CHANGELOG fix: "Fix 2D tile load when source array and tile have incompatible strides."

4.5.2 Data-Type Semantics Bugs

Description

Data-type semantics bugs arise when incorrect type propagation breaks the mathematical structure of tile-level layout and index transformations.

Example: Apache TVM #14112
# TVM MetaSchedule — int8 conv2d workload
# apply_trace generated by print(sch.trace) fails.
# Error stack: TransformLayout → transform_block_layout
#            → IndexMap::NonSurjectiveInverse
#
# These transforms support tiled layout algebra:
#   (n, y, x) → n × 64 + y × 8 + x
# Schedule trace contains index maps using T.int64(...).
# Data-type mismatch with int8 workload breaks these transforms.
# → Compilation failure
Root Cause

Tiled layout transformation passes fail when applied to an int8 workload, with errors in TransformLayout and IndexMap::NonSurjectiveInverse. The precise internal mechanism is not diagnosed in the issue, but the pattern is data-type mismatch breaking tiled layout transforms.

4.5.3 Operator Implementation Bugs

Description

Operator implementation bugs arise when the compiler synthesizes incorrect logic for operators under tiling, masking, or fusion.

Example: Triton #1846
# Reproducer from the issue (also Figure 3 in our paper)
@triton.jit
def matmul_argmax_kernel(A, B, Out,
    BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr):
    offs_m = tl.arange(0, BLOCK_M)
    offs_n = tl.arange(0, BLOCK_N)
    offs_d = tl.arange(0, BLOCK_K)

    a_ptrs = A + offs_m[:, None] * BLOCK_K + offs_d[None, :]
    b_ptrs = B + offs_n[:, None] * BLOCK_K + offs_d[None, :]

    a = tl.load(a_ptrs)
    b = tl.load(b_ptrs)

    dist = tl.dot(a, tl.trans(b))
    assignments = tl.argmax(dist, 1)
    tl.store(Out + offs_m, assignments)

M, N, K = 32, 32, 32
a = torch.rand(M, K, dtype=torch.float16, device="cuda")
b = torch.rand(N, K, dtype=torch.float16, device="cuda")
out = torch.zeros(M, dtype=torch.int32, device="cuda")
matmul_argmax_kernel[(1,)](a, b, out, BLOCK_M=M, BLOCK_N=N, BLOCK_K=K)
# ❌ SIGSEGV during compilation
# gdb: crash in mlir::OperationFolder::tryToFold()
#   within TritonGPURemoveLayoutConversionsPass
Root Cause

A semantically valid tl.dot + tl.argmax composition triggers a segfault during compilation. The gdb trace shows the crash occurs in TritonGPURemoveLayoutConversionsPass when the pattern rewriter attempts to fold operations produced by the argmax reduction lowering. In our paper's analysis, the compiler's synthesized reduction algorithm contains incorrect assumptions about the output layout after tl.dot.

4.6 Device-Specific Bugs

3.98% — 12 bugs
These bugs are caused by incompatibilities between compiler-generated tile kernels and target GPU hardware constraints. The tile program is semantically valid, but the lowered code violates device-specific requirements.
Example: TileLang #101
# Simplified from the full reproducer in the issue
@T.prim_func
def main(A: T.Buffer((M, K), "float16"),
         B: T.Buffer((K, N), "float16"),
         C: T.Buffer((M, N), "float16")):
    with T.Kernel(T.ceildiv(N, 192), T.ceildiv(M, 64), threads=128*2) as (bx, by):
        A_shared = T.alloc_shared((64, 32), "float16")     # block_M=64, block_K=32
        B_shared = T.alloc_shared((32, 192), "float16")    # block_K=32, block_N=192
        C_local  = T.alloc_fragment((64, 192), "float")

        T.clear(C_local)
        for k in T.Pipelined(T.ceildiv(K, 32), num_stages=3):
            T.copy(A[by * 64, k * 32], A_shared)
            T.copy(B[k * 32, bx * 192], B_shared)
            T.gemm(A_shared, B_shared, C_local, policy=T.GemmWarpPolicy.FullCol)
        T.copy(C_local, C[by * 64, bx * 192])

# 2 warpgroups → per-warpgroup MMA: M64, N96, K32
# Reporter states this should be supported (multiple of wgmma.m64n16k16)
#
# ❌ Compilation fails on NVIDIA H100 with CUTLASS/CUTE error:
#    "Static shape_div failure"
#    at cute::shape_div(_96, _64) — 96 and 64 are not divisible
Root Cause

The per-warpgroup MMA shape (M64, N96, K32) cannot be evenly decomposed into the SM90 MMA atom layout by the CUTE backend. Specifically, shape_div(96, 64) fails because 96 is not divisible by 64. The compiler does not validate tile-to-MMA compatibility before committing to the lowering path, causing the failure to surface as a backend static assertion rather than an informative diagnostic.

Root-Cause Failure Spaces

Tile-based DSL compilers begin from abstractions that leave boundary handling, synchronization, layout, and parallel mapping unresolved, and must synthesize these semantics during lowering, often conditioned on runtime tensor shapes. Observable symptoms may overlap across domains, but the underlying failure mechanisms are materially different in the tile compilers because they arise from compiler synthesized tile semantics rather than solely from preserving already fixed semantics.

Root-cause
dimension
Source-explicit Traditional compiler e.g., GCC, LLVM Spec-fixed DL compiler e.g., TVM, XLA Programmer-written Parallel kernel e.g., CUDA, OpenCL Compiler-synthesized Tile-based DSL compiler e.g., Triton, TileLang — this study
Control flow & scheduling RC1 Failures mainly come from misoptimizing existing branches, loops, and dependencies in source-explicit control flow. Failures mainly come from schedule selection, fusion decisions, and lowering choices over fixed operator-graph semantics. Failures mainly come from programmer-written branches, divergence guards, and barriers. The compiler synthesizes predicates, masks, synchronization, and warp control from tile iteration space, thread mapping, and shape specialization. Bugs arise when the compiler infers incorrect mask coverage for partial tiles, reorders instructions across implicit sync points, or propagates inconsistent warp metadata during lowering. 5.31% Trigger: shape-dependent — often latent under full-tile execution, exposed only near tile boundaries (e.g., sizes 31/32/33 or 63/64/65)
IR construction & transformation RC2 Failures arise from malformed IR or unsafe rewrites over explicit control and dataflow. Failures arise from graph lowering, fusion, and rewrite mistakes over fixed operator semantics. This is usually not the dominant failure space; most issues appear directly in source-level kernel logic. Tile IRs materialize per-tile buffers, layouts, masked regions, and mapping metadata. Bugs arise when the compiler constructs or rewrites these synthesized tile semantics incorrectly (e.g., index-space misalignment after fusion, dropping boundary tiles during predicate simplification, or non-deterministic IR construction defeating kernel caching). 16.28% Trigger: tile-config-dependent — often activated only when specific tile shapes interact with specific IR pass combinations
Tile mapping & launch RC3 Not usually a central failure space. Launch and execution parameters are chosen through schedule search and cost-model evaluation over fixed operators. Failures arise when these selected parameters produce invalid or inefficient execution settings. Failures come from programmer-specified grid/block dimensions and explicit logical-to-physical mapping mistakes. The compiler derives launch geometry and logical-tile-to-thread/block mapping from tile abstractions, where the programmer never specifies these. Bugs arise when this synthesized mapping breaks spatial coverage or legality under dynamic or non-divisible shapes. While related scheduling or mapping decisions may exist elsewhere, tile compilers make launch-geometry inference from tile structure a first-class part of lowering, creating a distinct and systematic failure mode. 6.31% Trigger: shape- & config-dependent — e.g., increasing max_tiles from 2→3 exposes launch-namespace conflicts; dynamic batch dims produce invalid grids
Memory RC4 Failures mainly come from incorrect addressing, alias analysis, buffer transformation, or memory optimization over source-level memory behavior. Failures mainly come from buffer planning, layout conversion, operator lowering, or backend interaction. Memory decisions are primarily expressed at the tensor/operator and schedule level. Failures mainly come from programmer-written indexing, shared-memory use, races, or missing synchronization. The compiler must map logical tiles onto hardware execution and the memory hierarchy: ownership, shared-memory staging, register/shared/global movement, caching, and layout. Bugs arise when this tile-to-hardware mapping is inconsistent with access patterns, reuse assumptions, or resource constraints. 19.27% Trigger: layout- & shape-dependent — often exposed near tile boundaries, under shared-memory pressure, or through premature deallocation / incorrect tile-configuration caching
Type & operator handling RC5 Failures involve generic type lowering or operator miscompilation. Failures mainly arise at the operator/framework level, such as type promotion, shape/type semantics, or backend dispatch mismatches. Numeric and operator mistakes are mostly source-authored. The compiler specializes operators and datatypes for tile shapes, masks, mixed precision, and backend intrinsics. Bugs arise when specialization breaks operator semantics or numeric meaning (e.g., incorrect lowering of fused operators at tile boundaries, missing NaN/Inf propagation under partial-tile masks, or ill-defined index-map algebra). This is the largest category (48.84%), reflecting aggressive specialization without standard fallback paths. 48.84% Trigger: dtype- & shape-dependent — mixing float16/bfloat16/int8, extreme values (NaN, Inf, signed zeros), boundary-adjacent shapes
Device specific RC6 Failures arise from backend target assumptions or unsupported codegen paths. Failures arise from hardware or library incompatibilities below the operator abstraction boundary. Failures arise when manually written kernels assume unsupported hardware behavior or resource layouts. The compiler commits to legality-constrained fragment choices, layouts, and resource usage. A semantically valid tile program can still lower to hardware-incompatible code when compiler assumptions do not match device constraints (e.g., tile dimensions incompatible with MMA atom layouts on a specific GPU generation). 3.98% Trigger: device- & tile-size-dependent — surface only at runtime on specific GPU architectures or driver/toolchain versions

Why the failure spaces are structurally distinct: In the left three columns, the core semantic contract is largely established before compilation, for example, by the source program, by operator specs, or by the programmer's kernel code. The compiler's job is primarily to preserve or realize those semantics. In the tile column, the compiler must additionally synthesize key execution semantics, such as control predicates, memory layout decisions, launch geometry, and operator specialization, from high-level tile abstractions, often conditioned on concrete tensor shapes, extents, or specialization time configurations. This synthesis creates a failure space that is more shape sensitive, configuration sensitive, and backend coupled than in the other three domains.

Percentages from Table 2 (n = 301 confirmed tile-codegen bugs). Observable symptoms may overlap across domains, but the underlying failure mechanisms are structurally distinct because tile compilers centralize the synthesis of control, layout, and mapping decisions from unresolved tile abstractions under concrete shapes and configurations.

Tile Codegen Bug Detector (Tile-CBDetect)

Overview
Figure 1. Overview on Detecting Tile Codegen Bugs in Triton programs

Preliminary Tool: Tile-CBDetect

Overview

Tile-CBDetect is a lightweight, Python-based detection framework for automatically identifying code generation bugs in tile-based GPU programs. The tool integrates automated test input generation with differential testing against PyTorch reference implementations, enabling end-to-end validation of tile kernels under diverse configurations.

The design of Tile-CBDetect is guided by the empirical findings on bug manifestation strategies and the importance of input diversity and cross-platform output comparison. It systematically constructs tile kernels and paired PyTorch implementations, then executes both to detect correctness and performance anomalies.

Capabilities and Testing Workflow

Tile-CBDetect automatically handles:

  • Dynamic construction of target tile kernels and matching PyTorch reference code.
  • Automated input generation, compilation, and execution of tile programs.
  • Memory allocation, kernel launch, and result collection.
  • Differential comparison between tile and PyTorch outputs within numerical tolerances.

This workflow supports end-to-end testing of tile code generation correctness, and can be extended to test multiple backends or compilers.

Supported Operators

The current implementation targets tile programs written in Triton and covers a representative set of commonly used operators:

  • Unary operators: exp, log, sqrt, rsqrt, tanh, sigmoid, abs, neg, relu.
  • Binary operators: add, sub, mul, div, max, min.
  • Reductions: row-wise reduce_sum over the last dimension of 2D tensors.
  • Matrix multiplication: tiled matmul under various matrix and tile configurations.

Operators are tested across float16, bfloat16, and float32 data types. Although the current implementation is Triton-specific, the framework is modular and can be extended to other DSLs such as TileLang, TVM, or Warp.

Input Generation Strategies

Tile-CBDetect supports both deterministic and fuzzed input suites to exercise diverse code paths and stress tiling boundaries.

Deterministic Differential Suite

The deterministic suite uses structured workloads designed to expose edge cases in tensor shapes, tiling boundaries, and type combinations, for example:

  • Unary operators on 1D tensors of length 1, 33, 1024, 10000, and 65537.
  • Binary operators on aligned pairs of length 33, 1024, and 10000.
  • Reductions on 2D tensors with shapes (33, 64), (1024, 128), and (65537, 64).
  • Matrix multiplications on triplets such as (96, 64, 96), (256, 256, 48), (129, 1024, 96) combined with tile configurations like (64, 32, 16) and (128, 64, 64).

Fuzzed and Tile-Aware Inputs

The fuzzer mode extends deterministic suites by:

  • Randomly selecting operator classes and input shapes (e.g., changing lengths to 257).
  • Sampling tiling and data type configurations from the deterministic pool.
  • Using random data (scaled by 0.5) for numerical stability and casting to target dtypes.
  • Optionally seeding per test for reproducible matmul and mixed-type cases.

While current input generation relies on structured suites and tile-aware randomization, integration of advanced techniques (e.g., constraint solving, symbolic execution, grammar-based fuzzing) is left as future work.

Minimizing False Positives

To maintain high detection precision, Tile-CBDetect incorporates several safeguards:

  • Domain sanitization: test data is adjusted per operator to avoid invalid values (e.g., using x = x.abs() + ε for log and rsqrt) and prevent NaNs or undefined behavior.
  • Relative error metrics: correctness is checked via relative differences using PyTorch outputs as reference, with configurable tolerances and optional higher-precision references for sensitive operators.
  • Stable performance measurements: multiple warmup and measurement runs (e.g., 3–5) and optional stable-timing modes reduce noise in timing-based oracles.
  • Workload gating: performance checks are applied only above a configurable minimum work threshold to avoid flagging trivial workloads.

Minor floating-point deviations and co-located NaNs are not considered bugs, reducing noise in correctness oracles.

Experimental Evaluation

Experiments were conducted on a server with an Intel Xeon w5-2545 CPU and two RTX 4000 Ada GPUs (CUDA 12.8). For each operator, Tile-CBDetect generated 500 input samples and executed three independent runs per configuration to reduce randomness from mutations and timing noise.

Across 17 Triton operators, Tile-CBDetect identified nine previously unknown codegen bugs, covering both correctness violations and performance anomalies. These issues have been reported to the Triton development team, and updates will be released through the project website.

Limitations and Outlook

Tile-CBDetect is designed as a preliminary, lightweight detector to demonstrate the feasibility of systematic codegen testing for tile-based programs. Key limitations include:

  • Operator coverage: support is currently limited to 17 core operators; extending to convolutions, softmax, and domain-specific kernels is future work.
  • Oracle scope: while several differential and metamorphic tests are implemented, more expressive assertions and invariants are needed to capture subtle semantic bugs.
  • Framework generality: the current implementation targets Triton; additional frontends and backends are required for TileLang, Warp, TVM, and other DSLs.
  • Assertion fidelity: correctness checks focus on numerical equivalence within tolerance bounds and do not yet enforce formal specifications or account for hardware nondeterminism.

Despite these constraints, Tile-CBDetect fills a critical gap in testing tile-level code generation. It provides a practical foundation for more comprehensive, architecture-aware testing frameworks for tile programming systems.