Skip to content

Control Flow: Divergence, Concurrency, and Coordination

Every programming language that targets a parallel machine must answer one question: what happens when different threads need to do different things?

On a CPU, the answer is trivial — each core has its own program counter, so each thread runs whatever code it likes. On a GPU, the answer is deeply constrained. The SIMT (Single Instruction, Multiple Thread) execution model groups threads into warps of 32 that share a single instruction pointer. When a branch sends some threads one way and other threads another way, the hardware cannot simply fork — it must serialize the divergent paths and mask inactive threads. This is the fundamental tension of GPU programming languages: the hardware wants uniformity, but real programs need heterogeneity.

CUDA exposes exactly one control-flow primitive for this: if. All threads evaluate the condition; threads where it is false are masked (deactivated); both paths execute sequentially within the warp. This is predicated execution — simple, universal, and sometimes ruinously expensive. A warp that diverges on an if runs both sides, throwing away half its throughput on each.

But predicated execution is not enough. Consider a matmul kernel where one group of threads should continuously fetch tiles from global memory while another group continuously multiplies tiles on the tensor cores. This is not a data-dependent branch — it is two structurally different programs that happen to share an address space. Trying to express this with if means one program pauses while the other runs. There is no overlap, no pipeline.

Croqtile introduces two additional control-flow primitives to fill this gap:

  • inthreads.asyncstructured concurrent regions: compile-time partitioning of threads into groups that run different programs simultaneously. The compiler generates separate instruction streams; the hardware schedules them independently.
  • shared event / wait / triggerinter-region signaling: lightweight synchronization tokens that let concurrent regions communicate safely.

Together with if, these three primitives cover the full spectrum of control flow in a GPU kernel: data-dependent branching, structural program composition, and inter-program coordination.

Predicated execution with if

Croqtile's if behaves like its C counterpart:

if (tile_id < total_tiles) {
  // body executes only when the condition is true
}

All threads in scope evaluate the condition. Threads where it is false skip the body. Within a single warp, if some threads take the branch and others do not, the hardware serializes the two paths — threads on the skipped side sit idle while the taken side runs, then vice versa. This is warp divergence, and it is the price of runtime flexibility.

When to use if: data-dependent decisions that cannot be resolved at compile time. Bounds checks, partial-tile guards, conditional accumulation. The condition can depend on runtime values — loop indices, input data, tile coordinates.

Cost model: divergence within a warp serializes both paths. Divergence across warps (where all threads in each warp agree) costs nothing — the hardware simply skips the not-taken path. The practical rule: keep if conditions warp-uniform (all 32 threads agree) whenever possible.

Structured concurrent regions with inthreads.async

inthreads.async solves a fundamentally different problem than if. Instead of asking "should this thread execute this code?" at runtime, it says "this group of threads runs this program, that group runs that program" at compile time.

parallel p1 by 2 : group-4 {

  inthreads.async (p1 == 0) {
    // program A: only warpgroup 0 compiles and runs this
  }

  inthreads.async (p1 == 1) {
    // program B: only warpgroup 1 compiles and runs this
  }
}

The distinction from if is structural, not just performance:

if (predicated execution) inthreads.async (structured concurrency)
Resolution Runtime — every thread evaluates the condition Compile time — thread assignment is fixed
Instruction streams One program; divergent threads masked Separate programs per region
Execution Serial within a warp if divergent Concurrent across warpgroups
PL analogy if/else in any language async/spawn in structured concurrency (Trio, Go goroutines, Cilk)
GPU analogy SPMD with masking MPMD within a single kernel launch

Why "structured"? The regions are lexically scoped — the compiler knows at parse time which threads belong to which region. There is no dynamic spawn, no unbounded concurrency. Each inthreads.async block is a static partition. This is what makes it amenable to compile-time analysis: the compiler can allocate registers differently for each region, emit different instruction schedules, and verify that shared resources are used safely.

The .async modifier. Without .async, inthreads would execute regions sequentially — thread subsets take turns. The .async suffix is the concurrency modifier: it tells the compiler and hardware that the regions may overlap in time. This is analogous to the async keyword in structured concurrency frameworks — it marks a region as independently schedulable.

The figure below shows the effect. The top timeline shows a single warpgroup alternating between DMA and MMA (sequential, no overlap). The bottom shows two warpgroups with inthreads.async — the producer's DMA and the consumer's MMA overlap in time:

Uniform vs structured-concurrent execution: sequential alternation vs overlapping regions Uniform vs structured-concurrent execution: sequential alternation vs overlapping regions

Top: one warpgroup alternates DMA and MMA — each waits for the other. Bottom: inthreads.async partitions into two concurrent programs — DMA and MMA overlap, roughly halving wall-clock time.

The canonical pattern: 1 producer + 1 consumer

The most common use of inthreads.async is the 1P1C (one producer, one consumer) split for matmul:

parallel p1 by 2 : group-4 {

  inthreads.async (p1 == 0) {
    // producer: only warpgroup 0 runs this
    // issue DMA / TMA loads, fill shared memory
  }

  inthreads.async (p1 == 1) {
    // consumer: only warpgroup 1 runs this
    // run MMA on shared memory, accumulate results
  }
}

parallel p1 by 2 : group-4 — Two warpgroups (128 threads each), indexed by p1.

inthreads.async (p1 == 0) — Warpgroup 0 compiles and runs the producer body; warpgroup 1 never sees this code.

inthreads.async (p1 == 1) — Warpgroup 1 runs the consumer body. The two blocks are separate programs sharing an address space.

But sharing an address space is exactly what makes this dangerous. Without coordination, the consumer might read a buffer before the producer has finished writing it. This is where events come in.

Inter-region signaling with events

When inthreads.async creates concurrent regions, those regions need a way to communicate. Croqtile provides events — lightweight synchronization tokens declared in shared memory:

shared event full;
shared event empty;

Events have two operations:

  • trigger name — signal that a condition is met (e.g., "data is ready")
  • wait name — block until the corresponding trigger fires

The producer calls trigger full after writing a tile to signal "data ready." The consumer calls wait full before reading, blocking until the signal arrives. Symmetrically, the consumer triggers empty after finishing its read (the buffer can be reused), and the producer waits on empty before writing the next tile.

This is a credit-based bounded buffer protocol — the same pattern used in operating systems (semaphores), network flow control (TCP window), and hardware (warp barriers). full is the "data available" credit; empty is the "buffer free" credit.

Event arrays for multi-stage pipelines

For pipelines with multiple buffered stages, declare event arrays:

shared event full[STAGES], empty[STAGES];

Each physical buffer slot gets its own full/empty pair. The ring index stage = iv_k % STAGES maps the unbounded K iteration to a fixed number of physical slots. With four stages, the producer can run several tiles ahead before blocking on wait empty.

Bootstrap protocol

The consumer must seed the empty credits before the K-loop starts:

foreach {s} in [STAGES] {
  trigger empty[s];
}

Without this bootstrap, the producer's first wait empty[0] blocks forever — a deadlock, not a mysterious MMA bug. This is a common pitfall: every bounded buffer protocol requires initial credits.

Chapter 6 develops the full double-buffered and multi-stage pipeline kernels that put these primitives to work. The examples there compose inthreads.async, events, swap, and mma.commit into complete, runnable matmul pipelines.

A 1P1C matmul skeleton

Here is how the three primitives sit together inside a Hopper matmul. Event-based synchronization is omitted — Chapter 6 adds the full pipeline protocol. Focus on the program structure:

__co__ void matmul(global f16 [M, K] lhs, global f16 [N, K] rhs, global f16 [M, N] output) {
  parallel {block_m, block_n} by [cdiv(M, WARP_M), cdiv(N, WARP_N)] : block {
    shared f16 [WARP_M, TILE_K] lhs_load_s;
    shared f16 [WARP_N, TILE_K] rhs_load_s;
    shared f16 [WARP_M, WARP_N] output_s;

    parallel p1 by 2 : group-4 {

      inthreads.async (p1 == 0) {
        foreach {iv_k} in [cdiv(K, TILE_K)] {
          dma.copy lhs.subspan(WARP_M, TILE_K).at(block_m, iv_k) => lhs_load_s;
          dma.copy rhs.chunkat(block_n, iv_k) => rhs_load_s;
        }
      }

      inthreads.async (p1 == 1) {
        mc = mma.fill.f16 0.0f;
        foreach {iv_k} in [cdiv(K, TILE_K)] {
          foreach {iv_warp} in [cdiv(TILE_K, WARP_K)] {
            ma = mma.load lhs_load_s.chunkat(_, iv_warp);
            mb = mma.load rhs_load_s.chunkat(_, iv_warp);
            mma.row.row mc, ma, mb;
          }
        }
        mma.store mc, output_s;
        dma.copy output_s => output.subspan(WARP_M, WARP_N).at(block_m, block_n);
      }
    }
  }
}

Producer foreach — Walks K with cdiv(K, TILE_K) steps; warpgroup 0 issues dma.copy into shared memory.

Consumer mma path — Warpgroup 1 never touches those DMAs; it reads shared memory, accumulates in mc, and writes the result.

Missing coordination — Both sides loop over K independently. The consumer assumes each K-slab is ready when it reads. Making that assumption correct requires events (Chapter 6).

Persistent scheduling and the if guard

In Chapters 3–4, the grid grew with the problem: roughly one block per output tile. For large matrices that means large launch counts, and the last wave of blocks often leaves SMs partially idle — tail underutilization.

A persistent kernel fixes the launch size (often near the SM count) and lets each block iterate over multiple tiles:

__co__ void matmul(global f16 [M, K] lhs, global f16 [N, K] rhs, global f16 [M, N] output) {
  int total_tiles = cdiv(M, WARP_M) * cdiv(N, WARP_N);

  parallel block_id by NUM_SMS : block {
    shared f16 [WARP_M, TILE_K] lhs_load_s;
    shared f16 [WARP_N, TILE_K] rhs_load_s;
    shared f16 [WARP_M, WARP_N] output_s;

    foreach {tile_iter} in [cdiv(total_tiles, NUM_SMS)] {
      tile_id = tile_iter # block_id;

      if (tile_id < total_tiles) {
        block_m = tile_id / cdiv(N, WARP_N);
        block_n = tile_id % cdiv(N, WARP_N);

        mc = mma.fill.f16 0.0f;
        foreach {iv_k} in [cdiv(K, TILE_K)] {
          dma.copy lhs.subspan(WARP_M, TILE_K).at(block_m, iv_k) => lhs_load_s;
          dma.copy rhs.subspan(WARP_N, TILE_K).at(block_n, iv_k) => rhs_load_s;

          foreach {iv_warp} in [cdiv(TILE_K, WARP_K)] {
            parallel p by 1 : group-4 {
              ma = mma.load lhs_load_s.chunkat(_, iv_warp);
              mb = mma.load rhs_load_s.chunkat(_, iv_warp);
              mma.row.row mc, ma, mb;
            }
          }
        }
        mma.store mc, output_s;
        dma.copy output_s => output.subspan(WARP_M, WARP_N).at(block_m, block_n);
      }
    }
  }
}

parallel block_id by NUM_SMS : block — Fixed worker count.

tile_id = tile_iter # block_id — Composes iteration with block index to stripe across tiles.

if (tile_id < total_tiles) — The if guard: a runtime predicate that skips the body for padding iterations. This is exactly the use case if is designed for — a data-dependent decision, not a structural partition.

Persistent kernel: striped tiles, block colors, and if guard for padding Persistent kernel: striped tiles, block colors, and if guard for padding

Data-dependent vs persistent grids

Aspect One block per tile Persistent (NUM_SMS blocks)
Grid size Grows with problem Fixed
Tail utilization Last wave may leave SMs idle All SMs stay busy
Extra constructs Minimal total_tiles, tile_iter # block_id, if
Complexity Lower Higher

parallel.async: host-level concurrency

Everything above runs inside a kernel. Sometimes you need concurrency at the host level: launch a grid without blocking the CPU.

parallel.async {px, py} by [grid_m, grid_n] : block {
  // kernel body
}

parallel.async returns control to the host immediately — the kernel is enqueued but the host does not wait for completion. This is the Croqtile equivalent of cudaLaunchKernel with a non-default stream.

This is host orchestration, orthogonal to in-kernel control flow. It does not replace inthreads.async for thread partitioning or if for runtime predicates — it decides when and where a grid runs relative to other grids.

Early return from parallel

Sometimes you need to early return from a parallel block. The yield keyword will simply generate a return; instruction to get out of device code.

parallel p by BLK_CNT: block {
  parallel q by THR_CNT : thread {
    if (cond) yield;
    ...
  }
}

Warning

The yield keyword currently will not do any compile-time nor runtime checks. Be careful about synchronization safety.

New syntax

Syntax Meaning
if (expr) { ... } Predicated execution — runtime conditional, divergent threads masked
inthreads.async (condition) Structured concurrent region — compile-time thread partitioning
shared event name Declare a synchronization token in shared memory
shared event name[N] Declare N synchronization tokens
trigger name Signal that a condition is met
wait name Block until the corresponding trigger fires
tile_id = tile_iter # block_id Compose indices for tile striping
int total_tiles = expr Local integer variable
parallel.async ... : block Non-blocking kernel launch
yield Early return from parallel block

Chapter summary

Concept Primitive When to use
Predicated execution if Data-dependent decisions (bounds, conditions)
Structured concurrency inthreads.async Compile-time thread partitioning (producer/consumer, heterogeneous roles)
Inter-region signaling shared event / wait / trigger Coordination between concurrent regions
Host concurrency parallel.async / stream s Multi-kernel overlap, non-blocking launch
Persistent scheduling if + foreach + # Fixed grid size, tile striping with padding guard

The 1P1C skeleton above is incomplete: without wait / trigger, the consumer can read before the producer has finished writing. Chapter 6 adds the full synchronization protocols — swap for single-schedule double buffering, events for multi-warpgroup pipelines — so the pipeline runs safely and at full throughput.