Performance Tuning Demos¶
In this part, we iteratively optimize three Croqtile GEMM kernels on H800 PCIe (SM90a, 114 SMs). Each is written as a continuous worklog: start from a correct baseline, measure against hardware limits, change one thing, re-measure, and tell the story of why each optimization works.
Before diving in, skim Setting Up: TimerOption, TFLOPS, and HW Efficiency for how timing and efficiency are computed — every story uses the same harness.
Dense GEMM FP16: From Naive to Tuned¶
Five-stage tutorial: naive → shared memory → Hopper TMA+WGMMA → warp specialization → production-tuned. Reaches 471 TFLOPS (105% of cuBLAS) via a 28-iteration parameter sweep. Each stage introduces new Croqtile primitives with side-by-side generated CUDA. Download kernel source files.
Sparse GEMM: FP16 and E4M3¶
Structured 2:4 sparse GEMM at 4096 × 8192 × 8192. FP16: 368 → 655 TFLOPS (+78%). E4M3: 671 → 1127 TFLOPS (+68%). Metadata delivery, the .co vs .cu boundary, and the 3-stage discontinuity.
Block-Scaled GEMM FP8¶
FP8 E4M3 with per-block scaling: 397 → 621 TFLOPS (+56%). TMA overlap with scale accumulation, N256 tiles, L2 promotion, and scale prefetch.
Fused MoE FP8¶
Fused Mixture-of-Experts end-to-end kernel for Qwen3.5-35B-A3B inference: 7.11 → 13.14 TFLOPS (+85%). Kernel fusion (7→4 kernels), parallel.async, CUDA Graphs, L2 persistence, QSG load pipelining, and the __cpp__ escape hatch.