CroqTile

CroqTile

The Next-Gen GPU & DSA Language Achieves 5x Productivity

— by CroqTile Core Team

CroqTile Intro Video
Replay
CroqTile | Overview

What is CroqTile?

The most easy-to-use kernel programming language

Simplicity + Performance

🛡️ Compile-Time Safety

🔷 Heterogeneous Computing

🤖 Born for AI Agents

Chapter 1

Highlight 1: Simplicity + Performance

Best of all competitors: Triton / CuTe / Cutile / Helion

CroqTile | Chapter 1 — GPU MatMul Anatomy

Demo: Tiled MatMul with Tensor Core & TMA in CroqTile

Global Memory (HBM) TILE_M TILE_K M K A [M, K] × TILE_K TILE_N K N B [K, N] = C [M, N] ① TMA: tile → SMEM Shared Memory (per SM, swizzled) smem_A [TILE_M, TILE_K] [64, 16] — T0: 8 fp16 scattered smem_B [TILE_K, TILE_N] [16, 128] — T0: 16 fp16 via desc ② ldmatrix scattered across threads ② via descriptor HW reads scattered Registers + Tensor Core frag A [MMA_M, MMA_K] [64,16] 8 fp16, 4 regs × frag B [MMA_K, MMA_N] [16,128] 16 fp16 via desc = ... ... row r, col c..c+1 row r+8, col c..c+1 16 col-groups × 4 f32 = 64 regs acc D [MMA_M, MMA_N] [64,128] 64 f32 per thread ③ WGMMA: acc += frag_A × frag_B (repeat per K-tile) Tensor Core D = A × B + C A × B + C = D m64n128k16 / warpgroup ④ stmatrix gather scattered regs → smem_c smem_c [TILE_M, TILE_N] = [64, 128] contiguous tile in SMEM ⑤ TMA store
CroqTile | Chapter 1 — Simplicity + Performance

Simplicity + Performance

🏆 The most intuitive GPU & DSA programming language among all competitors
📐 Least LoC for the same kernel among all competitors
40% of equivalent CUDA code — higher abstraction than Triton

Lines of code — persistent warp-specialized GEMM kernel

Croqtile-Python
30 L
Croqtile
36 L
TileLang
70 L
Triton
80 L
CUDA + CuTe
182 L
CUTLASS (CuTeDSL)
280 L
CroqTile | Chapter 1 — Simplicity + Performance

Why CroqTile Is So Simple & Intuitive

It operates on Tensors - Easy for programming.
It fully supports low-level programming - Without sacrificing safety.

CroqTile: shape + coord + stride is the complete tensor interface — no raw pointers or manual layout math exposed
Triton: pointer + shape + stride + offset per tensor via make_block_ptr
CuTe: Layout = Shape + Stride — compose hierarchical layouts manually
CUDA: raw pointer + manual index arithmetic at every access
Exposed to userCroqTileTritonTileLangCuTeCUDA
Shape
Coord
Stride
Layout
Offset
Raw pointer

Data movement is one line — hardware details are hidden

One-line TMA/DMA: tma.copy.swiz<128> src => dst — load or store a tile
Compiler generates TMA descriptor setup (128 bytes, 10+ fields in CUDA)
No manual CUtensorMap creation, no host-side cuTensorMapEncodeTiled
Zero boilerplate — optimization via modifiers, not code rewrites
Swizzle, cache promotion, zfill, async pipeline — all one keyword

Lines of code — TMA load + store

Croqtile
2 L
TileLang
4 L
Triton
12 L
CUTLASS
20 L
CUDA
35 L

One semantic for all tensor core instructions — compiler picks the best

Complete MMA cycle in 5 lines: fill, load×2, compute, store
Encapsulates WGMMA descriptor — no manual register layout or fence
Compile-time dispatch: HMMA (SM70/80) vs WGMMA (SM90) — zero runtime overhead

Lines of code — MMA operations

Triton
3 L
TileLang
3 L
Croqtile
5 L
CUTLASS
15 L
CUDA
25 L

Sparsity is a modifier, not a rewrite — add .sp and you’re done

mma.row.row.sp — single modifier adds 2:4 structured sparsity
Metadata prepack + warp shuffle handled implicitly by compiler
Triton: not supported. CUDA: 30 LOC metadata handling

Lines of code — metadata handling only

Croqtile
1 L
TileLang
3 L
Triton
N/A
CUTLASS
15 L
CUDA
30 L

One keyword for all parallelism — block, warp-group, thread, multi-device

Unified parallel-byone keyword for all parallelism levels
Level specifiers (block, group-4, thread) map to GPU & DSA hardware
Same code targets NVIDIA, AMD, and custom accelerators

Parallelism primitives needed

Croqtile
2
TileLang
4
Triton
5
CUTLASS
6
CUDA
8

Every optimization pattern is a first-class construct — zero boilerplate

Each pattern is 1–3 change sites — structural changes localized
Persistent kernel: change parallel binding + add tile loop
Warp spec: wrap in inthreads.async — inner code unchanged
Pipeline: add shared event + wait/trigger
Minimal change sites → easy for both humans and AI to apply
AI can iterate 68 times in one session (see AI-Tuning slides)
Each optimization is additive, not a rewrite
CroqTile | Chapter 1 — Simplicity + Performance

Zero-Cost Abstraction

CroqTile matches highly-optimized vendor libraries — higher-level syntax, near-zero performance gap

Kernel-Level Performance

Standalone kernel throughput (TFLOPS, higher is better) — CroqTile vs vendor libraries (Hopper)

GEMM  FP16, 8192×8192×8192
CroqTile
471.3 TFLOPS
PyTorch
447.5 TFLOPS
105.3% of PyTorch
GEMM  FP8 Blockscale, 256×5120×2048
CroqTile
262.7 TFLOPS
SGLang CUTLASS
256.9 TFLOPS
102.2% of SGLang CUTLASS
SPMM  FP16, 4096×8192×8192
CroqTile
630.5 TFLOPS
cuSparseLt
628.5 TFLOPS
100.3% of cuSparseLt
SPMM  FP8 E4M3, 12288×12288×12288
CroqTile
995.6 TFLOPS
cuSparseLt
952.1 TFLOPS
104.6% of cuSparseLt

End-to-End: Qwen3.5 on SGLang

CroqTile kernels integrated into SGLang — Hopper GPU

Qwen3.5 27B  (BF16, Hopper ×1)
MetricCroqTileNativeRatio
Prefill (ktokens/s)5,4986,20088.7%
Decode (tokens/s)262892.9%
Qwen3.5 27B  (FP8, Hopper ×1)
MetricCroqTileNativeRatio
Prefill (ktokens/s)6,5007,40087.8%
Decode (tokens/s)373994.9%

Why Zero-Cost?

CroqTile kernels competitive with highly-optimized vendor libraries (cuSparseLt, CUTLASS, PyTorch)
No runtime overhead — static dispatch, no vtable, no interpreter
End-to-end performance within 5% of native serving — with far simpler code

Chapter 2

Highlight 2: Compile-Time Code Safety

Best of all competitors: Triton / CuTe / Cutile / Helion

CroqTile | Chapter 2 — Compile-Time Safety

Why Compile-Time Checking Matters

GPU bugs are silent — wrong shapes or OOB reads produce garbage output, not crashes. No stack trace, no debugger for 10k threads.
CroqTile catches them at compile time — exact file, line, and root cause before any GPU execution.
Why? A standalone compiler (not a Python DSL) owns the full program — shapes, types, and memory hierarchy are statically visible across all passes.
CroqTile | Chapter 2 — Compile-Time Safety

Compile-Time Checks to Prevent Runtime Errors

Shape / type mismatch — incompatible ranks or element types in DMA/MMA
Out-of-bounds access — static offset/index analysis catches overruns
HW constraint violation — MMA config vs. target SM architecture
Unwait-ed async — futures never consumed are flagged
Parallel nesting rules — illegal constructs at wrong parallel level

In CUDA — all are silent runtime bugs

No crash, no warning — just wrong numbers or rare hangs.
CroqTile | Chapter 2 — Compile-Time Safety

Auto-Generated Runtime Assertions

When dynamic dimensions are involved, the compiler auto-generates host-side assertions — catching errors before GPU dispatch

Cross-parameter shape — shared dims (K in lhs[M,K], rhs[K,N]) must match
Static shape constants — actual shapes vs. declared values
Memory budget — cumulative allocation vs. HW limits
Tiling & iteration space — bounds > 0, non-degenerate strides
Dynamic index bounds — element indices within dimension when only known at launch
Launch limits — thread/grid products respect SM-specific maximums

In CUDA — no safety net

These errors produce silent wrong results, hangs, or UB.

Chapter 3

Highlight 3: Support for Heterogeneous Computing

Write once, run everywhere — GPU, AMD, DSA & multi-device

CroqTile | Chapter 3 — Heterogeneous Computing

Same Language, Multiple Targets.

-t cute -arch=sm_90a
NVIDIA H800 / H100
Hopper SM90a (PTX + SASS)
-t cute -arch=sm_80
NVIDIA A100
Ampere SM80 (PTX + SASS)
-t hip -arch=gfx1030
AMD Radeon RX 6900 XT
RDNA2 · gfx1030
-t dsa_x -arch dsa_arch_y
Custom DSA
Pluggable backend architecture
Myth: The compiler lowers CroqTile IR to each backend's native ISA. Minor changes are required.
CroqTile | Chapter 3 — Heterogeneous Computing

Multi-Device Programming

matmul [M, N] → partitioned by parallel-by mpi
Node (0,0)
M[0:M/2] × N[0:N/2]
GPU 0 · block kernel
Node (0,1)
M[0:M/2] × N[N/2:N]
GPU 1 · block kernel
Node (1,0)
M[M/2:M] × N[0:N/2]
GPU 2 · block kernel
Node (1,1)
M[M/2:M] × N[N/2:N]
GPU 3 · block kernel
Few boilerplate for heterogeneous computing:
Kernel launch — compiler generates host dispatch
Type conversion & alignments — handled automatically
Data partitioningparallel-by mpi splits work across ranks

Chapter 4

Highlight 4: Born for Agentic AI Programming

Superior context engineering + superior harness engineering

CroqTile | Chapter 4 — AI Programming

AI Tuning Convergence Comparison NVIDIA H800 PCIe · Template-Free · Same Agent

Same AI agent, same hardware, same budget.
Only the language (DSL) changes.

matmul FP16→FP32 · 16384³

Convergence: CroqTile 486, Triton 384, TileLang 343, Helion 318, CUDA 162, CuTe-DSL 27
486
CroqTile
384
Triton
343
TileLang
318
Helion
162
CUDA
27
CuTe-DSL
cuBLAS baseline: 420 TFLOPS · CroqTile = 115% of vendor

Blockscale GEMM E4M3→FP32 · 8192³

Convergence: CroqTile 711, TileLang 408, Triton 298, Helion 167
711
CroqTile
408
TileLang
298
Triton
167
Helion
cuBLAS baseline: 460 TFLOPS · CroqTile = 155% of vendor library · 6 iterations only

matmul FP16→FP32 · 16384×16384×512

Convergence: CroqTile 360, Triton 355, Helion 288, TileLang 260, CUDA 169
360
CroqTile
355
Triton
288
Helion
260
TileLang
169
CUDA
cuBLAS baseline: 403 TFLOPS · thin-K shape (memory-bound)

matmul FP16→FP32 · 16416³ (non-aligned)

Convergence: Triton 345, Helion 330, TileLang 329, CroqTile 300, CUDA 160
345
Triton
330
Helion
329
TileLang
300
CroqTile
160
CUDA
cuBLAS baseline: 413 TFLOPS · non-power-of-2 shape (ongoing tuning)
Y-axis = running-best TFLOPS, dashed line = cuBLAS baseline.
CroqTile | Chapter 4 — AI Tuning Results

CroqTile + AI-Tuning = Production-Level Performance

CroqTile AI-tuned wins 84% of 95 shapes.
Average speedup +16.7% over cuSPARSELt.
SPMM sweep M=N=K
SPMM sweep M
SPMM sweep N
SPMM sweep K
SPMM sweep M=N
SPMM sweep M=K
SPMM sweep N=K
CroqTile | Chapter 4 — AI Programming

Why CroqTile Fits AI Tuning

Same persistent warp-specialized GEMM — measured token count
CroqTile
36 LOC / 303 tokens Triton
80 LOC / 449 tokens CUDA + CuTe
182 LOC / 1530 tokens CUTLASS
280 LOC / 2350 tokens
303
CroqTile
449
Triton
1530
CUDA+CuTe
2350
CUTLASS
Implication: for the same 100 iterations, CroqTile uses ~70K tokens while CUDA uses ~350K. CroqTile can run 5× more iterations, or use a smaller model.
CroqTile: the same DMA transfer = 1 line of code & the compiler generates the barrier
Buffer = shared f16 [M, K] lhs_s; DMA = tma.copy.swiz<128> src => lhs_s; Barrier / fence = inserted automatically by the compiler
AI tuning actions → code-change count comparison
Change tile size (WARP_M/N) 1 site 5 sites
Change swizzle mode 1 site 3 sites
Change pipeline stages 1 site 4 sites
Change data type (f16→f8) 1 site 7 sites
Add warp specialization 2 sites 6 sites
CroqTile / CUDA
CUDA: errors appear only at runtime
Tile is not divisible → device hang Shared memory limit exceeded → silent launch failure Swizzle mismatch → wrong results mbarrier error → deadlock
CroqTile: full compile-time diagnostics
error: tile M=96 not divisible by WARP_M=64 error: smem 49408B exceeds 48KB limit error: swiz<128> requires 128B-aligned error: mma.row requires M%64==0
353
compile checks
1,319
runtime asserts
3–8s
CroqTile / iteration
30–90s
CUDA / iteration

Thank You!!

CroqTile — Fewer Lines. Safer Kernels. AI-Native.

CroqTile GitHub
Main Repository
github.com/LancerLab/croqtile
A Next-Gen Kernel Programming DSL for Maximizing Productivity
🌐
croqtile-website
lancerlab.github.io/croqtile-website
📖
croqtile-tutorial
lancerlab.github.io/croqtile-tutorial
🎛️
croqtile-tuner
lancerlab.github.io/croqtile-tuner
🤖
croqtile-skills
AI agent skills for CroqTile DSL development