CroqTileThe Next-Gen GPU & DSA Language Achieves 5x Productivity
Write less code. Catch more bugs. Ship faster GPU kernels.
__co__ void spmm(global f8_e4m3 [M, PACKED_K] lhs_packed, global u8 [M, META_COLS] lhs_meta, global f8_e4m3 [N, K] rhs, global f16 [M, N] output) { parallel {block_m, block_n} by [cdiv(M, SPMM_WARP_M), cdiv(N, SPMM_WARP_N)] : block { shared event full[STAGES], empty[STAGES]; shared f8_e4m3 [STAGES * SPMM_WARP_M, SPMM_PACKED_TILE_K] lhs_s; shared f8_e4m3 [STAGES * SPMM_WARP_N, SPMM_TILE_K] rhs_s; parallel p1 by 2 : group-4 { inthreads.async (p1 == 0) { foreach {iv_k} in [cdiv(K, SPMM_TILE_K)] { stage = iv_k % STAGES; wait empty[stage]; tma.copy.async<full[stage]>.swiz<32> lhs_packed.subspan(SPMM_WARP_M, SPMM_PACKED_TILE_K).at(block_m, iv_k) => lhs_s.subspan(SPMM_WARP_M, SPMM_PACKED_TILE_K).at(stage, 0); trigger full[stage]; } } inthreads.async (p1 == 1) { mc = mma.fill.f16 0.0f; foreach {iv_k} in [cdiv(K, SPMM_TILE_K)] { wait full[iv_k % STAGES]; ma = mma.load.swiz<32> lhs_s.at(iv_k % STAGES, 0); mb = mma.load.swiz<64> rhs_s.at(iv_k % STAGES, 0); me = mma.load lhs_meta_s; mma.row.row.sp mc, ma, mb, me; trigger empty[iv_k % STAGES]; } mma.store mc, output_s; tma.copy output_s => output.subspan(SPMM_WARP_M, SPMM_WARP_N).at(block_m, block_n); } } }}Easy to Use
- Achieve extreme performance with zero-cost abstractions
- The most intuitive GPU & DSA programming language among all competitors
- 40% of equivalent CUDA code — higher abstraction than Triton
- Work on tensors and tiles, not raw buffers and pointers
__co__ void matmul(global f16 [M, K] lhs, global f16 [N, K] rhs, global f16 [M, N] output, global s32 [T] schedule_m, global s32 [T] schedule_n) { int total = cdiv(M, WARP_M) * cdiv(N, WARP_N); parallel block_id by NUM_SMS : block { shared f16 [WARP_M, TILE_K] lhs_s; shared f16 [WARP_N, TILE_K] rhs_s; shared f16 [WARP_M, WARP_N] out_s; foreach {tile} in [cdiv(total, NUM_SMS)] { tile_id = tile # block_id; if (tile_id < total) { int bm = schedule_m.at(tile_id); int bn = schedule_n.at(tile_id); mc = mma.fill.f16 0.0f; foreach {iv_k} in [cdiv(K, TILE_K)] { tma.copy.swiz<128> lhs.subspan(WARP_M, TILE_K).at(bm, iv_k) => lhs_s; tma.copy.swiz<128> rhs.subspan(WARP_N, TILE_K).at(bn, iv_k) => rhs_s; parallel p by 1 : group-4 { ma = mma.load.swiz<128> lhs_s; mb = mma.load.swiz<128> rhs_s; mma.row.row mc, ma, mb; } } mma.store mc, out_s; tma.copy out_s => output.subspan(WARP_M, WARP_N).at(bm, bn); } } }}Lines of code — persistent warp-specialized GEMM kernel
Zero-Cost Abstraction
- Match or exceed optimized vendor libraries with near-zero performance gap
- No runtime overhead — static dispatch, no vtable, no interpreter
- End-to-end LLM serving within 5% of native on Hopper
- Compiles to the same optimal PTX as hand-tuned CUDA
Provide easy-to-use, high-level abstractions while eliminating runtime cost as much as possible through compile-time optimizations. You write concise, readable code; the compiler resolves abstractions via static dispatch and shape specialization to produce near-optimal code.
Model-level throughput is slightly below native because certain corner-case shapes and attention fusions still require tuning. Kernel-level performance already matches or exceeds vendors — the remaining gap closes as we tune more operator variants.
Compile-Time Safety
- 353 compile-time diagnostic checks across 7 compiler modules
- Catch tiling mismatches, shape errors, and DMA bugs before they run
- 1,319 runtime assertions guard every kernel in production
- Eliminate entire classes of bugs that plague CUDA development
Dynamic Shapes
- First-ever dynamic shared memory for GPU & DSA kernel programming
- Symbolic dimensions — write once, run on any shape
- Derived dimensions auto-computed: PACKED_K, META_COLS, grid, SMEM
- No template metaprogramming, no recompilation per shape
K iterations: 64· All derived from symbolic M, K, N
All dimensions must be template parameters or runtime arguments with manual size calculations
// Must manually compute every buffer size__shared__ float smA[BM][BK]; // compile-time// Dynamic SMEM requires manual calc + launch argint smem = BM * BK * sizeof(float);kernel<<<grid, block, smem>>>(...)Born for AI Tuning
- ~500 tokens per kernel — highest source density among GPU DSLs, keeping entire kernels inside AI context windows
- Unified primitives (dma.copy, mma.row) hide error-prone boilerplate, reducing token overhead and AI confusion
- 353 compile-time + 1,319 runtime checks serve as AI guardrails — 200+ autonomous iterations without failure
- Structured, AI-readable compiler diagnostics and well-documented CLI for autonomous tuning agents
- AI achieved 1127 TFLOPS from 671 baseline (+67.9%) in a single session of 68 iterations
User: continue experiment on branch ai-tune/2026-03-24/gemm_sp_f16. Read two reference experiment branches: branch 1: ai-tune/2026-03-21/gemm_sp_f16 branch 2: ai-tune/2026-03-21/gemm_sp_e4m3 Branch 2 best can achieve over 1100 TFLOPS, branch 1 best 600+ TFLOPS. Do not directly copy — optimize stage over stage. KICK OFF EXPERIMENTS NOW.
671 → 1127 TFLOPS (+67.9%) in 68 AI-driven iterations

Try CroqTile today.
Write less code. Catch more bugs. Ship faster GPU kernels.