Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection

Abstract

Tile-based programming frameworks are increasingly adopted to write high-performance GPU kernels in domains such as deep learning and scientific computing. While these frameworks enhance productivity and hardware utilization, their multi-stage compilation pipelines introduce distinct code generation bugs that are tightly coupled to input shapes, data types, and backend targets. These bugs often manifest as silent correctness or performance issues, making them difficult to detect using existing compiler testing tools. Additionally, the unique programming conventions of tile domain specific languages complicate root cause identification, while fixing such bugs demands specialized knowledge of tile abstractions and compilation pipelines. Despite the growing adoption of tile-based systems, their code generation bugs remain largely unexplored.
This paper presents the first systematic study of tile-program code generation bugs. We analyze 301 realworld bug reports from GitHub and categorize their root causes, symptoms, input patterns, test oracles that trigger these bugs and the strategies used to fix them. Our study provides foundational insights for building debugging, testing, and repair tools tailored to tile-based compiler infrastructures.

Root-cause dimension	Source-explicit Traditional compiler e.g., GCC, LLVM	Spec-fixed DL compiler e.g., TVM, XLA	Programmer-written Parallel kernel e.g., CUDA, OpenCL	Compiler-synthesized Tile-based DSL compiler e.g., Triton, TileLang — this study
Control flow & scheduling RC1	Failures mainly come from misoptimizing existing branches, loops, and dependencies in source-explicit control flow.	Failures mainly come from schedule selection, fusion decisions, and lowering choices over fixed operator-graph semantics.	Failures mainly come from programmer-written branches, divergence guards, and barriers.	The compiler synthesizes predicates, masks, synchronization, and warp control from tile iteration space, thread mapping, and shape specialization. Bugs arise when the compiler infers incorrect mask coverage for partial tiles, reorders instructions across implicit sync points, or propagates inconsistent warp metadata during lowering. 5.31% Trigger: shape-dependent — often latent under full-tile execution, exposed only near tile boundaries (e.g., sizes 31/32/33 or 63/64/65)
IR construction & transformation RC2	Failures arise from malformed IR or unsafe rewrites over explicit control and dataflow.	Failures arise from graph lowering, fusion, and rewrite mistakes over fixed operator semantics.	This is usually not the dominant failure space; most issues appear directly in source-level kernel logic.	Tile IRs materialize per-tile buffers, layouts, masked regions, and mapping metadata. Bugs arise when the compiler constructs or rewrites these synthesized tile semantics incorrectly (e.g., index-space misalignment after fusion, dropping boundary tiles during predicate simplification, or non-deterministic IR construction defeating kernel caching). 16.28% Trigger: tile-config-dependent — often activated only when specific tile shapes interact with specific IR pass combinations
Tile mapping & launch RC3	Not usually a central failure space.	Launch and execution parameters are chosen through schedule search and cost-model evaluation over fixed operators. Failures arise when these selected parameters produce invalid or inefficient execution settings.	Failures come from programmer-specified grid/block dimensions and explicit logical-to-physical mapping mistakes.	The compiler derives launch geometry and logical-tile-to-thread/block mapping from tile abstractions, where the programmer never specifies these. Bugs arise when this synthesized mapping breaks spatial coverage or legality under dynamic or non-divisible shapes. While related scheduling or mapping decisions may exist elsewhere, tile compilers make launch-geometry inference from tile structure a first-class part of lowering, creating a distinct and systematic failure mode. 6.31% Trigger: shape- & config-dependent — e.g., increasing max_tiles from 2→3 exposes launch-namespace conflicts; dynamic batch dims produce invalid grids
Memory RC4	Failures mainly come from incorrect addressing, alias analysis, buffer transformation, or memory optimization over source-level memory behavior.	Failures mainly come from buffer planning, layout conversion, operator lowering, or backend interaction. Memory decisions are primarily expressed at the tensor/operator and schedule level.	Failures mainly come from programmer-written indexing, shared-memory use, races, or missing synchronization.	The compiler must map logical tiles onto hardware execution and the memory hierarchy: ownership, shared-memory staging, register/shared/global movement, caching, and layout. Bugs arise when this tile-to-hardware mapping is inconsistent with access patterns, reuse assumptions, or resource constraints. 19.27% Trigger: layout- & shape-dependent — often exposed near tile boundaries, under shared-memory pressure, or through premature deallocation / incorrect tile-configuration caching
Type & operator handling RC5	Failures involve generic type lowering or operator miscompilation.	Failures mainly arise at the operator/framework level, such as type promotion, shape/type semantics, or backend dispatch mismatches.	Numeric and operator mistakes are mostly source-authored.	The compiler specializes operators and datatypes for tile shapes, masks, mixed precision, and backend intrinsics. Bugs arise when specialization breaks operator semantics or numeric meaning (e.g., incorrect lowering of fused operators at tile boundaries, missing NaN/Inf propagation under partial-tile masks, or ill-defined index-map algebra). This is the largest category (48.84%), reflecting aggressive specialization without standard fallback paths. 48.84% Trigger: dtype- & shape-dependent — mixing float16/bfloat16/int8, extreme values (NaN, Inf, signed zeros), boundary-adjacent shapes
Device specific RC6	Failures arise from backend target assumptions or unsupported codegen paths.	Failures arise from hardware or library incompatibilities below the operator abstraction boundary.	Failures arise when manually written kernels assume unsupported hardware behavior or resource layouts.	The compiler commits to legality-constrained fragment choices, layouts, and resource usage. A semantically valid tile program can still lower to hardware-incompatible code when compiler assumptions do not match device constraints (e.g., tile dimensions incompatible with MMA atom layouts on a specific GPU generation). 3.98% Trigger: device- & tile-size-dependent — surface only at runtime on specific GPU architectures or driver/toolchain versions

Preliminary Tool: Tile-CBDetect

Overview

Tile-CBDetect is a lightweight, Python-based detection framework for automatically identifying code generation bugs in tile-based GPU programs. The tool integrates automated test input generation with differential testing against PyTorch reference implementations, enabling end-to-end validation of tile kernels under diverse configurations.

The design of Tile-CBDetect is guided by the empirical findings on bug manifestation strategies and the importance of input diversity and cross-platform output comparison. It systematically constructs tile kernels and paired PyTorch implementations, then executes both to detect correctness and performance anomalies.

Capabilities and Testing Workflow

Tile-CBDetect automatically handles:

Dynamic construction of target tile kernels and matching PyTorch reference code.
Automated input generation, compilation, and execution of tile programs.
Memory allocation, kernel launch, and result collection.
Differential comparison between tile and PyTorch outputs within numerical tolerances.

This workflow supports end-to-end testing of tile code generation correctness, and can be extended to test multiple backends or compilers.

Supported Operators

The current implementation targets tile programs written in Triton and covers a representative set of commonly used operators:

Unary operators: exp, log, sqrt, rsqrt, tanh, sigmoid, abs, neg, relu.
Binary operators: add, sub, mul, div, max, min.
Reductions: row-wise reduce_sum over the last dimension of 2D tensors.
Matrix multiplication: tiled matmul under various matrix and tile configurations.

Operators are tested across float16, bfloat16, and float32 data types. Although the current implementation is Triton-specific, the framework is modular and can be extended to other DSLs such as TileLang, TVM, or Warp.

Input Generation Strategies

Tile-CBDetect supports both deterministic and fuzzed input suites to exercise diverse code paths and stress tiling boundaries.

Deterministic Differential Suite

The deterministic suite uses structured workloads designed to expose edge cases in tensor shapes, tiling boundaries, and type combinations, for example:

Unary operators on 1D tensors of length 1, 33, 1024, 10000, and 65537.
Binary operators on aligned pairs of length 33, 1024, and 10000.
Reductions on 2D tensors with shapes (33, 64), (1024, 128), and (65537, 64).
Matrix multiplications on triplets such as (96, 64, 96), (256, 256, 48), (129, 1024, 96) combined with tile configurations like (64, 32, 16) and (128, 64, 64).

Fuzzed and Tile-Aware Inputs

The fuzzer mode extends deterministic suites by:

Randomly selecting operator classes and input shapes (e.g., changing lengths to 257).
Sampling tiling and data type configurations from the deterministic pool.
Using random data (scaled by 0.5) for numerical stability and casting to target dtypes.
Optionally seeding per test for reproducible matmul and mixed-type cases.

While current input generation relies on structured suites and tile-aware randomization, integration of advanced techniques (e.g., constraint solving, symbolic execution, grammar-based fuzzing) is left as future work.

Minimizing False Positives

To maintain high detection precision, Tile-CBDetect incorporates several safeguards:

Domain sanitization: test data is adjusted per operator to avoid invalid values (e.g., using x = x.abs() + ε for log and rsqrt) and prevent NaNs or undefined behavior.
Relative error metrics: correctness is checked via relative differences using PyTorch outputs as reference, with configurable tolerances and optional higher-precision references for sensitive operators.
Stable performance measurements: multiple warmup and measurement runs (e.g., 3–5) and optional stable-timing modes reduce noise in timing-based oracles.
Workload gating: performance checks are applied only above a configurable minimum work threshold to avoid flagging trivial workloads.

Minor floating-point deviations and co-located NaNs are not considered bugs, reducing noise in correctness oracles.

Experimental Evaluation

Experiments were conducted on a server with an Intel Xeon w5-2545 CPU and two RTX 4000 Ada GPUs (CUDA 12.8). For each operator, Tile-CBDetect generated 500 input samples and executed three independent runs per configuration to reduce randomness from mutations and timing noise.

Across 17 Triton operators, Tile-CBDetect identified nine previously unknown codegen bugs, covering both correctness violations and performance anomalies. These issues have been reported to the Triton development team, and updates will be released through the project website.

Limitations and Outlook

Tile-CBDetect is designed as a preliminary, lightweight detector to demonstrate the feasibility of systematic codegen testing for tile-based programs. Key limitations include:

Operator coverage: support is currently limited to 17 core operators; extending to convolutions, softmax, and domain-specific kernels is future work.
Oracle scope: while several differential and metamorphic tests are implemented, more expressive assertions and invariants are needed to capture subtle semantic bugs.
Framework generality: the current implementation targets Triton; additional frontends and backends are required for TileLang, Warp, TVM, and other DSLs.
Assertion fidelity: correctness checks focus on numerical equivalence within tolerance bounds and do not yet enforce formal specifications or account for hardware nondeterminism.

Despite these constraints, Tile-CBDetect fills a critical gap in testing tile-level code generation. It provides a practical foundation for more comprehensive, architecture-aware testing frameworks for tile programming systems.

Characterizing Real-World Bugs in Tile Programs for Automated Bug Detection [ISSTA 2026]

Abstract

Bug Dataset

Bug Causes Showcase

4.1 Control Flow and Scheduling Bugs

4.1.1 Branch Predication Bugs

4.1.2 Instruction Scheduling Bugs

4.1.3 Warp Control Bugs

4.2 IR Construction and Transformation Bugs

4.2.1 IR Construction Bugs

4.2.2 IR Transformation Bugs

4.3 Tile Mapping and Launch Bugs

4.3.1 Launch Configuration Bugs

4.3.2 Thread-Block Mapping Bugs

4.4 Memory Bugs

4.4.1 Indexing and Strides Bugs

4.4.2 Resource Allocation Bugs

4.4.3 Ordering and Caching Bugs

4.5 Type and Operator Bugs

4.5.1 Special-Value Handling Bugs

4.5.2 Data-Type Semantics Bugs

4.5.3 Operator Implementation Bugs

4.6 Device-Specific Bugs

Root-Cause Failure Spaces

Tile Codegen Bug Detector (Tile-CBDetect)