Open Source · Production Ready

CroqTileThe Next-Gen GPU & DSA Language Achieves 5x Productivity

Write less code. Catch more bugs. Ship faster GPU kernels.

__co__ void spmm(global f8_e4m3 [M, PACKED_K] lhs_packed, global u8 [M, META_COLS] lhs_meta, global f8_e4m3 [N, K] rhs, global f16 [M, N] output) {
  parallel {block_m, block_n} by [cdiv(M, SPMM_WARP_M), cdiv(N, SPMM_WARP_N)] : block {
    shared event full[STAGES], empty[STAGES];
    shared f8_e4m3 [STAGES * SPMM_WARP_M, SPMM_PACKED_TILE_K] lhs_s;
    shared f8_e4m3 [STAGES * SPMM_WARP_N, SPMM_TILE_K] rhs_s;
    parallel p1 by 2 : group-4 {
      inthreads.async (p1 == 0) {
        foreach {iv_k} in [cdiv(K, SPMM_TILE_K)] {
          stage = iv_k % STAGES;
          wait empty[stage];
          tma.copy.async<full[stage]>.swiz<32> lhs_packed.subspan(SPMM_WARP_M, SPMM_PACKED_TILE_K).at(block_m, iv_k) => lhs_s.subspan(SPMM_WARP_M, SPMM_PACKED_TILE_K).at(stage, 0);
          trigger full[stage];
        }
      }
      inthreads.async (p1 == 1) {
        mc = mma.fill.f16 0.0f;
        foreach {iv_k} in [cdiv(K, SPMM_TILE_K)] {
          wait full[iv_k % STAGES];
          ma = mma.load.swiz<32> lhs_s.at(iv_k % STAGES, 0);
          mb = mma.load.swiz<64> rhs_s.at(iv_k % STAGES, 0);
          me = mma.load lhs_meta_s;
          mma.row.row.sp mc, ma, mb, me;
          trigger empty[iv_k % STAGES];
        }
        mma.store mc, output_s;
        tma.copy output_s => output.subspan(SPMM_WARP_M, SPMM_WARP_N).at(block_m, block_n);
      }
    }
  }
}

Sparse GEMM (FP8 E4M3)Real production kernel from choreo repository

Start with less, achieve more

Easy to Use

Achieve extreme performance with zero-cost abstractions
The most intuitive GPU & DSA programming language among all competitors
40% of equivalent CUDA code — higher abstraction than Triton
Work on tensors and tiles, not raw buffers and pointers

Work on tensors rather than raw buffers

Avoid the complexity of HW programming — TMA, DMA, tensor-core

Make parallel programs easier to write and reason about

Seamless C++ and Python integration — use CroqPy or link any C++ library

matmul_hilbert.coCROQTILE

__co__ void matmul(global f16 [M, K] lhs, global f16 [N, K] rhs, global f16 [M, N] output, global s32 [T] schedule_m, global s32 [T] schedule_n) {
  int total = cdiv(M, WARP_M) * cdiv(N, WARP_N);
  parallel block_id by NUM_SMS : block {
    shared f16 [WARP_M, TILE_K] lhs_s;
    shared f16 [WARP_N, TILE_K] rhs_s;
    shared f16 [WARP_M, WARP_N] out_s;
    foreach {tile} in [cdiv(total, NUM_SMS)] {
      tile_id = tile # block_id;
      if (tile_id < total) {
        int bm = schedule_m.at(tile_id);
        int bn = schedule_n.at(tile_id);
        mc = mma.fill.f16 0.0f;
        foreach {iv_k} in [cdiv(K, TILE_K)] {
          tma.copy.swiz<128> lhs.subspan(WARP_M, TILE_K).at(bm, iv_k) => lhs_s;
          tma.copy.swiz<128> rhs.subspan(WARP_N, TILE_K).at(bn, iv_k) => rhs_s;
          parallel p by 1 : group-4 {
            ma = mma.load.swiz<128> lhs_s;
            mb = mma.load.swiz<128> rhs_s;
            mma.row.row mc, ma, mb;
          }
        }
        mma.store mc, out_s;
        tma.copy out_s => output.subspan(WARP_M, WARP_N).at(bm, bn);
      }
    }
  }
}

CroqTile

Croqtile-Python

Triton

CUTLASS (CuTeDSL)

280

CUDA + CuTe

182

Lines of code — persistent warp-specialized GEMM kernel

High-level code, vendor-matching speed

Zero-Cost Abstraction

Match or exceed optimized vendor libraries with near-zero performance gap
No runtime overhead — static dispatch, no vtable, no interpreter
End-to-end LLM serving within 5% of native on Hopper
Compiles to the same optimal PTX as hand-tuned CUDA

Design Philosophy

Provide easy-to-use, high-level abstractions while eliminating runtime cost as much as possible through compile-time optimizations. You write concise, readable code; the compiler resolves abstractions via static dispatch and shape specialization to produce near-optimal code.

Standalone kernel throughput · HopperTFLOPS

GEMM FP168192×8192×8192

105.3%

CroqTile

471.3

PyTorch

447.5

GEMM FP8256×5120×2048

102.2%

CroqTile

262.7

CUTLASS

256.9

SPMM FP164096×8192×8192

100.3%

CroqTile

630.5

cuSparseLt

628.5

SPMM FP8 E4M312288³

104.6%

CroqTile

995.6

cuSparseLt

952.1

End-to-end LLM serving · Hopper ×1within 5% of native

Qwen3.5 27B BF16

Prefill

5.5k

vs 6.2k · 88.7%

Decode

vs 28 tok/s · 92.9%

Qwen3.5 27B FP8

Prefill

6.5k

vs 7.4k · 87.8%

Decode

vs 39 tok/s · 94.9%

Model-level throughput is slightly below native because certain corner-case shapes and attention fusions still require tuning. Kernel-level performance already matches or exceeds vendors — the remaining gap closes as we tune more operator variants.

Catch bugs before they run

Compile-Time Safety

353 compile-time diagnostic checks across 7 compiler modules
Catch tiling mismatches, shape errors, and DMA bugs before they run
1,319 runtime assertions guard every kernel in production
Eliminate entire classes of bugs that plague CUDA development

Detect tiling mismatches at compile time

Enable aggressive runtime checks for easier debugging

Eliminate entire classes of DMA-related bugs

Compiler diagnostic checks across modules353

Early Semantic Analysis

earlysema.cpp189

type mismatch in SELECTinvalid parallel nestingundeclared variable in __co__ scope

Semantic Validation

semacheck.cpp65

inconsistent shapes for spanned-operationinvalid DMA target memory spaceMMA: matrix shapes do not match

Type Inference

typeinfer.cpp38

cannot infer element typeconflicting types in reductionincompatible accumulator dtype

Shape Inference

shapeinfer.cpp33

tiling factor exceeds data sizeindex out of bounds for dimensionspan size mismatch in chunkat

Loop Vectorization

loop_vectorize.cpp16

vectorize factor not divisibleinvalid loop bounds for SIMD

Static Assessment

assess.cpp4

shape incompatibility proven at compile time

Code Generation

codegen + others8

target architecture constraint violatedDMA size exceeds 2^32 bytes

earlysemasemachecktypeinfershapeinferloop_vectorizeassesscodegen + others

Symbolic dimensions, zero boilerplate

Dynamic Shapes

First-ever dynamic shared memory for GPU & DSA kernel programming
Symbolic dimensions — write once, run on any shape
Derived dimensions auto-computed: PACKED_K, META_COLS, grid, SMEM
No template metaprogramming, no recompilation per shape

First-class symbolic dimension support

Dynamic shared memory without boilerplate

Static and runtime memory support unified

Pick a shape — derived dims update automatically

M4096

K4096

N4096

PACKED_K

2048

K/2

META_COLS

512

K/8

Grid

64×16

⌈M/WM⌉×⌈N/WN⌉

SMEM

40KB

auto

K iterations: 64· All derived from symbolic M, K, N

Why competitors can't do this

All dimensions must be template parameters or runtime arguments with manual size calculations

// Must manually compute every buffer size
__shared__ float smA[BM][BK]; // compile-time
// Dynamic SMEM requires manual calc + launch arg
int smem = BM * BK * sizeof(float);
kernel<<<grid, block, smem>>>(...)

Language meets machine intelligence

Born for AI Tuning

~500 tokens per kernel — highest source density among GPU DSLs, keeping entire kernels inside AI context windows
Unified primitives (dma.copy, mma.row) hide error-prone boilerplate, reducing token overhead and AI confusion
353 compile-time + 1,319 runtime checks serve as AI guardrails — 200+ autonomous iterations without failure
Structured, AI-readable compiler diagnostics and well-documented CLI for autonomous tuning agents
AI achieved 1127 TFLOPS from 671 baseline (+67.9%) in a single session of 68 iterations

User:
continue experiment on branch
ai-tune/2026-03-24/gemm_sp_f16.
Read two reference experiment branches:
  branch 1: ai-tune/2026-03-21/gemm_sp_f16
  branch 2: ai-tune/2026-03-21/gemm_sp_e4m3
Branch 2 best can achieve over 1100 TFLOPS,
branch 1 best 600+ TFLOPS. Do not directly
copy — optimize stage over stage.
KICK OFF EXPERIMENTS NOW.

baseline

671

iter001

759

iter016

772

iter023

811

iter068

1127

671 → 1127 TFLOPS (+67.9%) in 68 AI-driven iterations

AI-Tune Convergence · E4M3 4096×8192×8192TFLOPS vs Iteration