CroqTile

The Next-Gen GPU & DSA Language Achieves 5x Productivity

— by CroqTile Core Team

CroqTile Intro Video

Replay

CroqTile | Overview

What is CroqTile?

The most easy-to-use kernel programming language

⚡ Simplicity + Performance

🛡️ Compile-Time Safety

🔷 Heterogeneous Computing

🤖 Born for AI Agents

Chapter 1

Highlight 1: Simplicity + Performance

Best of all competitors: Triton / CuTe / Cutile / Helion

CroqTile | Chapter 1 — GPU MatMul Anatomy

Demo: Tiled MatMul with Tensor Core & TMA in CroqTile

CroqTile | Chapter 1 — Simplicity + Performance

Simplicity + Performance

🏆 The most intuitive GPU & DSA programming language among all competitors

📐 Least LoC for the same kernel among all competitors

⚡ 40% of equivalent CUDA code — higher abstraction than Triton

Lines of code — persistent warp-specialized GEMM kernel

Croqtile-Python

30 L

Croqtile

36 L

TileLang

70 L

Triton

80 L

CUDA + CuTe

182 L

CUTLASS (CuTeDSL)

280 L

CroqTile | Chapter 1 — Simplicity + Performance

Why CroqTile Is So Simple & Intuitive

It operates on Tensors - Easy for programming.
It fully supports low-level programming - Without sacrificing safety.

CroqTile: shape + coord + stride is the complete tensor interface — no raw pointers or manual layout math exposed

Triton: pointer + shape + stride + offset per tensor via make_block_ptr

CuTe: Layout = Shape + Stride — compose hierarchical layouts manually

CUDA: raw pointer + manual index arithmetic at every access

Exposed to user	CroqTile	Triton	TileLang	CuTe	CUDA
Shape	✓	✓	✓	✓	✓
Coord	✓	✓	✓	✓	✓
Stride	✓	✓	—	✓	✓
Layout	—	—	—	✓	—
Offset	—	✓	—	—	✓
Raw pointer	—	✓	—	✓	✓

Data movement is one line — hardware details are hidden

One-line TMA/DMA: tma.copy.swiz<128> src => dst — load or store a tile

Compiler generates TMA descriptor setup (128 bytes, 10+ fields in CUDA)

No manual CUtensorMap creation, no host-side cuTensorMapEncodeTiled

Zero boilerplate — optimization via modifiers, not code rewrites

Swizzle, cache promotion, zfill, async pipeline — all one keyword

Lines of code — TMA load + store

Croqtile

2 L

TileLang

4 L

Triton

12 L

CUTLASS

20 L

CUDA

35 L

One semantic for all tensor core instructions — compiler picks the best

Complete MMA cycle in 5 lines: fill, load×2, compute, store

Encapsulates WGMMA descriptor — no manual register layout or fence

Compile-time dispatch: HMMA (SM70/80) vs WGMMA (SM90) — zero runtime overhead

Lines of code — MMA operations

Triton

3 L

TileLang

3 L

Croqtile

5 L

CUTLASS

15 L

CUDA

25 L

Sparsity is a modifier, not a rewrite — add `.sp` and you’re done

mma.row.row.sp — single modifier adds 2:4 structured sparsity

Metadata prepack + warp shuffle handled implicitly by compiler

Triton: not supported. CUDA: 30 LOC metadata handling

Lines of code — metadata handling only

Croqtile

1 L

TileLang

3 L

Triton ✗

N/A

CUTLASS

15 L

CUDA

30 L

One keyword for all parallelism — block, warp-group, thread, multi-device

Unified parallel-by — one keyword for all parallelism levels

Level specifiers (block, group-4, thread) map to GPU & DSA hardware

Same code targets NVIDIA, AMD, and custom accelerators

Parallelism primitives needed

Croqtile

TileLang

Triton

CUTLASS

CUDA

Every optimization pattern is a first-class construct — zero boilerplate

Each pattern is 1–3 change sites — structural changes localized

Persistent kernel: change parallel binding + add tile loop

Warp spec: wrap in inthreads.async — inner code unchanged

Pipeline: add shared event + wait/trigger

Minimal change sites → easy for both humans and AI to apply

AI can iterate 68 times in one session (see AI-Tuning slides)

Each optimization is additive, not a rewrite

CroqTile | Chapter 1 — Simplicity + Performance

Zero-Cost Abstraction

CroqTile matches highly-optimized vendor libraries — higher-level syntax, near-zero performance gap

Kernel-Level Performance

Standalone kernel throughput (TFLOPS, higher is better) — CroqTile vs vendor libraries (Hopper)

GEMM FP16, 8192×8192×8192

CroqTile

471.3 TFLOPS

PyTorch

447.5 TFLOPS

105.3% of PyTorch

GEMM FP8 Blockscale, 256×5120×2048

CroqTile

262.7 TFLOPS

SGLang CUTLASS

256.9 TFLOPS

102.2% of SGLang CUTLASS

SPMM FP16, 4096×8192×8192

CroqTile

630.5 TFLOPS

cuSparseLt

628.5 TFLOPS

100.3% of cuSparseLt

SPMM FP8 E4M3, 12288×12288×12288

CroqTile

995.6 TFLOPS

cuSparseLt

952.1 TFLOPS

104.6% of cuSparseLt

End-to-End: Qwen3.5 on SGLang

CroqTile kernels integrated into SGLang — Hopper GPU

Qwen3.5 27B (BF16, Hopper ×1)

Metric	CroqTile	Native	Ratio
Prefill (ktokens/s)	5,498	6,200	88.7%
Decode (tokens/s)	26	28	92.9%

Qwen3.5 27B (FP8, Hopper ×1)

Metric	CroqTile	Native	Ratio
Prefill (ktokens/s)	6,500	7,400	87.8%
Decode (tokens/s)	37	39	94.9%

Why Zero-Cost?

✓ CroqTile kernels competitive with highly-optimized vendor libraries (cuSparseLt, CUTLASS, PyTorch)

✓ No runtime overhead — static dispatch, no vtable, no interpreter

✓ End-to-end performance within 5% of native serving — with far simpler code

Chapter 2

Highlight 2: Compile-Time Code Safety

Best of all competitors: Triton / CuTe / Cutile / Helion

CroqTile | Chapter 2 — Compile-Time Safety

Why Compile-Time Checking Matters

GPU bugs are silent — wrong shapes or OOB reads produce garbage output, not crashes. No stack trace, no debugger for 10k threads.

CroqTile catches them at compile time — exact file, line, and root cause before any GPU execution.

Why? A standalone compiler (not a Python DSL) owns the full program — shapes, types, and memory hierarchy are statically visible across all passes.

CroqTile | Chapter 2 — Compile-Time Safety

Compile-Time Checks to Prevent Runtime Errors

Shape / type mismatch — incompatible ranks or element types in DMA/MMA

Out-of-bounds access — static offset/index analysis catches overruns

HW constraint violation — MMA config vs. target SM architecture

Unwait-ed async — futures never consumed are flagged

Parallel nesting rules — illegal constructs at wrong parallel level

In CUDA — all are silent runtime bugs

No crash, no warning — just wrong numbers or rare hangs.

CroqTile | Chapter 2 — Compile-Time Safety

Auto-Generated Runtime Assertions

When dynamic dimensions are involved, the compiler auto-generates host-side assertions — catching errors before GPU dispatch

Cross-parameter shape — shared dims (K in lhs[M,K], rhs[K,N]) must match

Static shape constants — actual shapes vs. declared values

Memory budget — cumulative allocation vs. HW limits

Tiling & iteration space — bounds > 0, non-degenerate strides

Dynamic index bounds — element indices within dimension when only known at launch

Launch limits — thread/grid products respect SM-specific maximums

In CUDA — no safety net

These errors produce silent wrong results, hangs, or UB.

Chapter 3

Highlight 3: Support for Heterogeneous Computing

Write once, run everywhere — GPU, AMD, DSA & multi-device

CroqTile | Chapter 3 — Heterogeneous Computing

Same Language, Multiple Targets.

-t cute -arch=sm_90a

→

NVIDIA H800 / H100

Hopper SM90a (PTX + SASS)

-t cute -arch=sm_80

→

NVIDIA A100

Ampere SM80 (PTX + SASS)

-t hip -arch=gfx1030

→

AMD Radeon RX 6900 XT

RDNA2 · gfx1030

-t dsa_x -arch dsa_arch_y

→

Custom DSA

Pluggable backend architecture

Myth: The compiler lowers CroqTile IR to each backend's native ISA. Minor changes are required.

CroqTile | Chapter 3 — Heterogeneous Computing

Multi-Device Programming

matmul [M, N] → partitioned by parallel-by mpi

Node (0,0)

M[0:M/2] × N[0:N/2]

GPU 0 · block kernel

Node (0,1)

M[0:M/2] × N[N/2:N]

GPU 1 · block kernel

Node (1,0)

M[M/2:M] × N[0:N/2]

GPU 2 · block kernel

Node (1,1)

M[M/2:M] × N[N/2:N]

GPU 3 · block kernel

Few boilerplate for heterogeneous computing:
• Kernel launch — compiler generates host dispatch
• Type conversion & alignments — handled automatically
• Data partitioning — parallel-by mpi splits work across ranks

Chapter 4

Highlight 4: Born for Agentic AI Programming

Superior context engineering + superior harness engineering

CroqTile | Chapter 4 — AI Programming

AI Tuning Convergence Comparison NVIDIA H800 PCIe · Template-Free · Same Agent

Same AI agent, same hardware, same budget.
Only the language (DSL) changes.

matmul FP16→FP32 · 16384³

486

CroqTile

384

Triton

343

TileLang

318

Helion

162

CUDA

27

CuTe-DSL

cuBLAS baseline: 420 TFLOPS · CroqTile = 115% of vendor

Blockscale GEMM E4M3→FP32 · 8192³

711

CroqTile

408

TileLang

298

Triton

167

Helion

cuBLAS baseline: 460 TFLOPS · CroqTile = 155% of vendor library · 6 iterations only

matmul FP16→FP32 · 16384×16384×512

360

CroqTile

355

Triton

288

Helion

260

TileLang

169

CUDA

cuBLAS baseline: 403 TFLOPS · thin-K shape (memory-bound)

matmul FP16→FP32 · 16416³ (non-aligned)

345

Triton

330

Helion

329

TileLang

300

CroqTile

160

CUDA

cuBLAS baseline: 413 TFLOPS · non-power-of-2 shape (ongoing tuning)

Y-axis = running-best TFLOPS, dashed line = cuBLAS baseline.

CroqTile | Chapter 4 — AI Tuning Results

CroqTile + AI-Tuning = Production-Level Performance

CroqTile AI-tuned wins 84% of 95 shapes.

Average speedup +16.7% over cuSPARSELt.

CroqTile | Chapter 4 — AI Programming

Why CroqTile Fits AI Tuning

Same persistent warp-specialized GEMM — measured token count

CroqTile

36 LOC / 303 tokens Triton

80 LOC / 449 tokens CUDA + CuTe

182 LOC / 1530 tokens CUTLASS

280 LOC / 2350 tokens

303

CroqTile

449

Triton

1530

CUDA+CuTe

2350

CUTLASS

Implication: for the same 100 iterations, CroqTile uses ~70K tokens while CUDA uses ~350K. CroqTile can run 5× more iterations, or use a smaller model.

CroqTile: the same DMA transfer = 1 line of code & the compiler generates the barrier

✔ Buffer = shared f16 [M, K] lhs_s; ✔ DMA = tma.copy.swiz<128> src => lhs_s; ✔ Barrier / fence = inserted automatically by the compiler

AI tuning actions → code-change count comparison

Change tile size (WARP_M/N) 1 site 5 sites

Change swizzle mode 1 site 3 sites

Change pipeline stages 1 site 4 sites

Change data type (f16→f8) 1 site 7 sites

Add warp specialization 2 sites 6 sites

CroqTile / CUDA

CUDA: errors appear only at runtime

⚠ Tile is not divisible → device hang ⚠ Shared memory limit exceeded → silent launch failure ⚠ Swizzle mismatch → wrong results ⚠ mbarrier error → deadlock

CroqTile: full compile-time diagnostics

✔ error: tile M=96 not divisible by WARP_M=64 ✔ error: smem 49408B exceeds 48KB limit ✔ error: swiz<128> requires 128B-aligned ✔ error: mma.row requires M%64==0

353

compile checks

1,319

runtime asserts

3–8s

CroqTile / iteration

30–90s

CUDA / iteration

Thank You!!

CroqTile — Fewer Lines. Safer Kernels. AI-Native.

Main Repository

github.com/LancerLab/croqtile

A Next-Gen Kernel Programming DSL for Maximizing Productivity

🌐

croqtile-website

lancerlab.github.io/croqtile-website

📖

croqtile-tutorial

lancerlab.github.io/croqtile-tutorial

🎛️

croqtile-tuner

lancerlab.github.io/croqtile-tuner

🤖

croqtile-skills

AI agent skills for CroqTile DSL development

The Next-Gen GPU & DSA Language Achieves 5x Productivity

What is CroqTile?

⚡ Simplicity + Performance

🛡️ Compile-Time Safety

🔷 Heterogeneous Computing

🤖 Born for AI Agents

Chapter 1

Highlight 1: Simplicity + Performance

Demo: Tiled MatMul with Tensor Core & TMA in CroqTile

Simplicity + Performance

Why CroqTile Is So Simple & Intuitive

It operates on Tensors - Easy for programming.It fully supports low-level programming - Without sacrificing safety.

Data movement is one line — hardware details are hidden

One semantic for all tensor core instructions — compiler picks the best

Sparsity is a modifier, not a rewrite — add .sp and you’re done

One keyword for all parallelism — block, warp-group, thread, multi-device

Every optimization pattern is a first-class construct — zero boilerplate

Zero-Cost Abstraction

Kernel-Level Performance

End-to-End: Qwen3.5 on SGLang

Chapter 2

Highlight 2: Compile-Time Code Safety

Why Compile-Time Checking Matters

Compile-Time Checks to Prevent Runtime Errors

In CUDA — all are silent runtime bugs

Auto-Generated Runtime Assertions

In CUDA — no safety net

Chapter 3

Highlight 3: Support for Heterogeneous Computing

Same Language, Multiple Targets.

Multi-Device Programming

Chapter 4

Highlight 4: Born for Agentic AI Programming

AI Tuning Convergence Comparison NVIDIA H800 PCIe · Template-Free · Same Agent

matmul FP16→FP32 · 16384³

Blockscale GEMM E4M3→FP32 · 8192³

matmul FP16→FP32 · 16384×16384×512

matmul FP16→FP32 · 16416³ (non-aligned)

CroqTile + AI-Tuning = Production-Level Performance

Why CroqTile Fits AI Tuning

Thank You!!

It operates on Tensors - Easy for programming.
It fully supports low-level programming - Without sacrificing safety.

Sparsity is a modifier, not a rewrite — add `.sp` and you’re done