Open Source · Production Ready

CroqTileThe Next-Gen GPU & DSA Language Achieves 5x Productivity

Write less code. Catch more bugs. Ship faster GPU kernels.

__co__ void spmm(global f8_e4m3 [M, PACKED_K] lhs_packed, global u8 [M, META_COLS] lhs_meta, global f8_e4m3 [N, K] rhs, global f16 [M, N] output) {
parallel {block_m, block_n} by [cdiv(M, SPMM_WARP_M), cdiv(N, SPMM_WARP_N)] : block {
shared event full[STAGES], empty[STAGES];
shared f8_e4m3 [STAGES * SPMM_WARP_M, SPMM_PACKED_TILE_K] lhs_s;
shared f8_e4m3 [STAGES * SPMM_WARP_N, SPMM_TILE_K] rhs_s;
parallel p1 by 2 : group-4 {
inthreads.async (p1 == 0) {
foreach {iv_k} in [cdiv(K, SPMM_TILE_K)] {
stage = iv_k % STAGES;
wait empty[stage];
tma.copy.async<full[stage]>.swiz<32> lhs_packed.subspan(SPMM_WARP_M, SPMM_PACKED_TILE_K).at(block_m, iv_k) => lhs_s.subspan(SPMM_WARP_M, SPMM_PACKED_TILE_K).at(stage, 0);
trigger full[stage];
}
}
inthreads.async (p1 == 1) {
mc = mma.fill.f16 0.0f;
foreach {iv_k} in [cdiv(K, SPMM_TILE_K)] {
wait full[iv_k % STAGES];
ma = mma.load.swiz<32> lhs_s.at(iv_k % STAGES, 0);
mb = mma.load.swiz<64> rhs_s.at(iv_k % STAGES, 0);
me = mma.load lhs_meta_s;
mma.row.row.sp mc, ma, mb, me;
trigger empty[iv_k % STAGES];
}
mma.store mc, output_s;
tma.copy output_s => output.subspan(SPMM_WARP_M, SPMM_WARP_N).at(block_m, block_n);
}
}
}
}
Sparse GEMM (FP8 E4M3)Real production kernel from choreo repository
Start with less, achieve more

Easy to Use

  • Achieve extreme performance with zero-cost abstractions
  • The most intuitive GPU & DSA programming language among all competitors
  • 40% of equivalent CUDA code — higher abstraction than Triton
  • Work on tensors and tiles, not raw buffers and pointers
Work on tensors rather than raw buffers
Avoid the complexity of HW programming — TMA, DMA, tensor-core
Make parallel programs easier to write and reason about
Seamless C++ and Python integration — use CroqPy or link any C++ library
matmul_hilbert.coCROQTILE
__co__ void matmul(global f16 [M, K] lhs, global f16 [N, K] rhs, global f16 [M, N] output, global s32 [T] schedule_m, global s32 [T] schedule_n) {
int total = cdiv(M, WARP_M) * cdiv(N, WARP_N);
parallel block_id by NUM_SMS : block {
shared f16 [WARP_M, TILE_K] lhs_s;
shared f16 [WARP_N, TILE_K] rhs_s;
shared f16 [WARP_M, WARP_N] out_s;
foreach {tile} in [cdiv(total, NUM_SMS)] {
tile_id = tile # block_id;
if (tile_id < total) {
int bm = schedule_m.at(tile_id);
int bn = schedule_n.at(tile_id);
mc = mma.fill.f16 0.0f;
foreach {iv_k} in [cdiv(K, TILE_K)] {
tma.copy.swiz<128> lhs.subspan(WARP_M, TILE_K).at(bm, iv_k) => lhs_s;
tma.copy.swiz<128> rhs.subspan(WARP_N, TILE_K).at(bn, iv_k) => rhs_s;
parallel p by 1 : group-4 {
ma = mma.load.swiz<128> lhs_s;
mb = mma.load.swiz<128> rhs_s;
mma.row.row mc, ma, mb;
}
}
mma.store mc, out_s;
tma.copy out_s => output.subspan(WARP_M, WARP_N).at(bm, bn);
}
}
}
}
CroqTile
25
Croqtile-Python
25
Triton
64
CUTLASS (CuTeDSL)
280
CUDA + CuTe
182

Lines of code — persistent warp-specialized GEMM kernel

High-level code, vendor-matching speed

Zero-Cost Abstraction

  • Match or exceed optimized vendor libraries with near-zero performance gap
  • No runtime overhead — static dispatch, no vtable, no interpreter
  • End-to-end LLM serving within 5% of native on Hopper
  • Compiles to the same optimal PTX as hand-tuned CUDA
Design Philosophy

Provide easy-to-use, high-level abstractions while eliminating runtime cost as much as possible through compile-time optimizations. You write concise, readable code; the compiler resolves abstractions via static dispatch and shape specialization to produce near-optimal code.

Standalone kernel throughput · HopperTFLOPS
GEMM FP168192×8192×8192
105.3%
CroqTile
471.3
PyTorch
447.5
GEMM FP8256×5120×2048
102.2%
CroqTile
262.7
CUTLASS
256.9
SPMM FP164096×8192×8192
100.3%
CroqTile
630.5
cuSparseLt
628.5
SPMM FP8 E4M312288³
104.6%
CroqTile
995.6
cuSparseLt
952.1
End-to-end LLM serving · Hopper ×1within 5% of native
Qwen3.5 27B BF16
Prefill
5.5k
vs 6.2k · 88.7%
Decode
26
vs 28 tok/s · 92.9%
Qwen3.5 27B FP8
Prefill
6.5k
vs 7.4k · 87.8%
Decode
37
vs 39 tok/s · 94.9%

Model-level throughput is slightly below native because certain corner-case shapes and attention fusions still require tuning. Kernel-level performance already matches or exceeds vendors — the remaining gap closes as we tune more operator variants.

Catch bugs before they run

Compile-Time Safety

  • 353 compile-time diagnostic checks across 7 compiler modules
  • Catch tiling mismatches, shape errors, and DMA bugs before they run
  • 1,319 runtime assertions guard every kernel in production
  • Eliminate entire classes of bugs that plague CUDA development
Detect tiling mismatches at compile time
Enable aggressive runtime checks for easier debugging
Eliminate entire classes of DMA-related bugs
Compiler diagnostic checks across modules353
Early Semantic Analysis
earlysema.cpp189
type mismatch in SELECTinvalid parallel nestingundeclared variable in __co__ scope
Semantic Validation
semacheck.cpp65
inconsistent shapes for spanned-operationinvalid DMA target memory spaceMMA: matrix shapes do not match
Type Inference
typeinfer.cpp38
cannot infer element typeconflicting types in reductionincompatible accumulator dtype
Shape Inference
shapeinfer.cpp33
tiling factor exceeds data sizeindex out of bounds for dimensionspan size mismatch in chunkat
Loop Vectorization
loop_vectorize.cpp16
vectorize factor not divisibleinvalid loop bounds for SIMD
Static Assessment
assess.cpp4
shape incompatibility proven at compile time
Code Generation
codegen + others8
target architecture constraint violatedDMA size exceeds 2^32 bytes
earlysemasemachecktypeinfershapeinferloop_vectorizeassesscodegen + others
Symbolic dimensions, zero boilerplate

Dynamic Shapes

  • First-ever dynamic shared memory for GPU & DSA kernel programming
  • Symbolic dimensions — write once, run on any shape
  • Derived dimensions auto-computed: PACKED_K, META_COLS, grid, SMEM
  • No template metaprogramming, no recompilation per shape
First-class symbolic dimension support
Dynamic shared memory without boilerplate
Static and runtime memory support unified
Pick a shape — derived dims update automatically
M4096
K4096
N4096
PACKED_K
2048
K/2
META_COLS
512
K/8
Grid
64×16
⌈M/WM⌉×⌈N/WN⌉
SMEM
40KB
auto

K iterations: 64· All derived from symbolic M, K, N

Why competitors can't do this

All dimensions must be template parameters or runtime arguments with manual size calculations

// Must manually compute every buffer size
__shared__ float smA[BM][BK]; // compile-time
// Dynamic SMEM requires manual calc + launch arg
int smem = BM * BK * sizeof(float);
kernel<<<grid, block, smem>>>(...)
Language meets machine intelligence

Born for AI Tuning

  • ~500 tokens per kernel — highest source density among GPU DSLs, keeping entire kernels inside AI context windows
  • Unified primitives (dma.copy, mma.row) hide error-prone boilerplate, reducing token overhead and AI confusion
  • 353 compile-time + 1,319 runtime checks serve as AI guardrails — 200+ autonomous iterations without failure
  • Structured, AI-readable compiler diagnostics and well-documented CLI for autonomous tuning agents
  • AI achieved 1127 TFLOPS from 671 baseline (+67.9%) in a single session of 68 iterations
User:
continue experiment on branch
ai-tune/2026-03-24/gemm_sp_f16.
Read two reference experiment branches:
  branch 1: ai-tune/2026-03-21/gemm_sp_f16
  branch 2: ai-tune/2026-03-21/gemm_sp_e4m3
Branch 2 best can achieve over 1100 TFLOPS,
branch 1 best 600+ TFLOPS. Do not directly
copy — optimize stage over stage.
KICK OFF EXPERIMENTS NOW.
baseline
671
iter001
759
iter016
772
iter023
811
iter068
1127

671 → 1127 TFLOPS (+67.9%) in 68 AI-driven iterations

AI-Tune Convergence · E4M3 4096×8192×8192TFLOPS vs Iteration
AI tuning convergence: 671→1127 TFLOPS across 68 iterationsAI tuning convergence: 671→1127 TFLOPS across 68 iterations
671 → 1127 TFLOPS (+67.9%) across 68 AI-driven iterations
CroqTile mascot

Try CroqTile today.

Write less code. Catch more bugs. Ship faster GPU kernels.

62%
less code than CUDA+CuTe
353
compile-time checks
~500
tokens/iter — 70% less than CUDA tuning