CroqTile
— by CroqTile Core Team
CroqTile
|
Overview
The most easy-to-use kernel programming language
Best of all competitors: Triton / CuTe / Cutile / Helion
CroqTile
|
Chapter 1 — GPU MatMul Anatomy
CroqTile
|
Chapter 1 — Simplicity + Performance
Lines of code — persistent warp-specialized GEMM kernel
CroqTile
|
Chapter 1 — Simplicity + Performance
make_block_ptr| Exposed to user | CroqTile | Triton | TileLang | CuTe | CUDA |
|---|---|---|---|---|---|
| Shape | ✓ | ✓ | ✓ | ✓ | ✓ |
| Coord | ✓ | ✓ | ✓ | ✓ | ✓ |
| Stride | ✓ | ✓ | — | ✓ | ✓ |
| Layout | — | — | — | ✓ | — |
| Offset | — | ✓ | — | — | ✓ |
| Raw pointer | — | ✓ | — | ✓ | ✓ |
tma.copy.swiz<128> src => dst — load or store a tileCUtensorMap creation, no host-side cuTensorMapEncodeTiledLines of code — TMA load + store
Lines of code — MMA operations
.sp and you’re donemma.row.row.sp — single modifier adds 2:4 structured sparsityLines of code — metadata handling only
parallel-by — one keyword for all parallelism levelsblock, group-4, thread) map to GPU & DSA hardwareParallelism primitives needed
parallel binding + add tile loopinthreads.async — inner code unchangedshared event + wait/trigger
CroqTile
|
Chapter 1 — Simplicity + Performance
CroqTile matches highly-optimized vendor libraries — higher-level syntax, near-zero performance gap
Standalone kernel throughput (TFLOPS, higher is better) — CroqTile vs vendor libraries (Hopper)
CroqTile kernels integrated into SGLang — Hopper GPU
| Metric | CroqTile | Native | Ratio |
|---|---|---|---|
| Prefill (ktokens/s) | 5,498 | 6,200 | 88.7% |
| Decode (tokens/s) | 26 | 28 | 92.9% |
| Metric | CroqTile | Native | Ratio |
|---|---|---|---|
| Prefill (ktokens/s) | 6,500 | 7,400 | 87.8% |
| Decode (tokens/s) | 37 | 39 | 94.9% |
Why Zero-Cost?
Best of all competitors: Triton / CuTe / Cutile / Helion
CroqTile
|
Chapter 2 — Compile-Time Safety
CroqTile
|
Chapter 2 — Compile-Time Safety
No crash, no warning — just wrong numbers or rare hangs.
CroqTile
|
Chapter 2 — Compile-Time Safety
When dynamic dimensions are involved, the compiler auto-generates host-side assertions — catching errors before GPU dispatch
K in lhs[M,K], rhs[K,N]) must match
These errors produce silent wrong results, hangs, or UB.
Write once, run everywhere — GPU, AMD, DSA & multi-device
CroqTile
|
Chapter 3 — Heterogeneous Computing
CroqTile
|
Chapter 3 — Heterogeneous Computing
parallel-by mpi splits work across ranks
Superior context engineering + superior harness engineering
CroqTile
|
Chapter 4 — AI Programming
CroqTile
|
Chapter 4 — AI Tuning Results
CroqTile
|
Chapter 4 — AI Programming
shared f16 [M, K] lhs_s;
✔
DMA = tma.copy.swiz<128> src => lhs_s;
✔
Barrier / fence = inserted automatically by the compiler
error: tile M=96 not divisible by WARP_M=64
✔
error: smem 49408B exceeds 48KB limit
✔
error: swiz<128> requires 128B-aligned
✔
error: mma.row requires M%64==0
CroqTile — Fewer Lines. Safer Kernels. AI-Native.