The largest category. These bugs arise when the compiler synthesizes arithmetic
behavior (i.e., value propagation, type promotion and operator logic) based on tile shapes,
masks, and fusion strategies, with subtle mismatches silently corrupting results.
4.5.1 Special-Value Handling Bugs
Description
Special-value handling bugs occur when the compiler mishandles floating-point
edge cases (NaN, denormals, signed zeros) during tile-level execution.
Example: NVIDIA Warp #688
# Reproducer from the issue
import numpy as np
import warp as wp
def tile_cholesky(M, L, tilesize):
@wp.kernel
def cholesky(M: wp.array(dtype=wp.float32, ndim=2),
L: wp.array(dtype=wp.float32, ndim=2)):
M_tile = wp.tile_load(M, shape=(tilesize, tilesize))
L_tile = wp.tile_cholesky(M_tile)
wp.tile_store(L, L_tile)
wp.launch_tiled(cholesky, dim=(1,), inputs=[M, L], block_dim=32)
dim = 5; tilesize = 4 # tile 4×4, but source array is 5×5
Mwp = wp.array(np.eye(dim) * 2.0, dtype=wp.float32)
Lwp = wp.zeros((dim, dim), dtype=wp.float32)
tile_cholesky(Mwp, Lwp, tilesize)
err = np.linalg.cholesky(np.eye(dim) * 2.0)[:tilesize, :tilesize] \
- Lwp.numpy()[:tilesize, :tilesize]
# ❌ Unexpected results including NaNs
Root Cause
wp.tile_load does not correctly handle 2D source arrays with
strides incompatible with the tile shape. When the source array is 5×5 but the
tile is 4×4, the mismatch in memory layout causes some tile lanes to read
uninitialized or out-of-bounds values, producing NaNs. The CHANGELOG fix:
"Fix 2D tile load when source array and tile have incompatible
strides."
4.5.2 Data-Type Semantics Bugs
Description
Data-type semantics bugs arise when incorrect type propagation breaks the
mathematical structure of tile-level layout and index transformations.
Example: Apache TVM #14112
# TVM MetaSchedule — int8 conv2d workload
# apply_trace generated by print(sch.trace) fails.
# Error stack: TransformLayout → transform_block_layout
# → IndexMap::NonSurjectiveInverse
#
# These transforms support tiled layout algebra:
# (n, y, x) → n × 64 + y × 8 + x
# Schedule trace contains index maps using T.int64(...).
# Data-type mismatch with int8 workload breaks these transforms.
# → Compilation failure
Root Cause
Tiled layout transformation passes fail when applied to an int8 workload,
with errors in TransformLayout and
IndexMap::NonSurjectiveInverse. The precise internal mechanism
is not diagnosed in the issue, but the pattern is data-type mismatch breaking
tiled layout transforms.
4.5.3 Operator Implementation Bugs
Description
Operator implementation bugs arise when the compiler synthesizes incorrect
logic for operators under tiling, masking, or fusion.
Example: Triton #1846
# Reproducer from the issue (also Figure 3 in our paper)
@triton.jit
def matmul_argmax_kernel(A, B, Out,
BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr):
offs_m = tl.arange(0, BLOCK_M)
offs_n = tl.arange(0, BLOCK_N)
offs_d = tl.arange(0, BLOCK_K)
a_ptrs = A + offs_m[:, None] * BLOCK_K + offs_d[None, :]
b_ptrs = B + offs_n[:, None] * BLOCK_K + offs_d[None, :]
a = tl.load(a_ptrs)
b = tl.load(b_ptrs)
dist = tl.dot(a, tl.trans(b))
assignments = tl.argmax(dist, 1)
tl.store(Out + offs_m, assignments)
M, N, K = 32, 32, 32
a = torch.rand(M, K, dtype=torch.float16, device="cuda")
b = torch.rand(N, K, dtype=torch.float16, device="cuda")
out = torch.zeros(M, dtype=torch.int32, device="cuda")
matmul_argmax_kernel[(1,)](a, b, out, BLOCK_M=M, BLOCK_N=N, BLOCK_K=K)
# ❌ SIGSEGV during compilation
# gdb: crash in mlir::OperationFolder::tryToFold()
# within TritonGPURemoveLayoutConversionsPass
Root Cause
A semantically valid tl.dot + tl.argmax composition
triggers a segfault during compilation. The gdb trace shows the crash occurs
in TritonGPURemoveLayoutConversionsPass when the pattern rewriter
attempts to fold operations produced by the argmax reduction lowering. In our
paper's analysis, the compiler's synthesized reduction algorithm contains
incorrect assumptions about the output layout after tl.dot.