Skip to content

Synchronization in Practice: Pipelines, Buffers, and Events

Chapter 5 introduced three control-flow primitives: if for predicated execution, inthreads.async for structured concurrent regions, and events (shared event / wait / trigger) for inter-region signaling. This chapter puts them to work. You will see two progressively complex kernel patterns, each demonstrating how the primitives compose to solve real synchronization problems.

The first pattern — double buffering with swap — uses a single thread group that interleaves loading and computing within one program. No inthreads.async, no events: just two buffer handles and a rotation. The second pattern — the full 1P1C event pipeline — splits loading and computing into separate concurrent programs with inthreads.async and coordinates them with event arrays.

Both patterns solve the same underlying problem: you cannot read a buffer while someone is writing to it. They differ in how they structure the solution.

Double buffering with swap

Give the K-loop two logical buffers. While the math drains buffer 0, DMA fills buffer 1 with the next tile. After the math step, swap the handles: what was "next" becomes "current," and the freed slot is ready for the following load.

Croqtile spells this with dma.copy.async (non-blocking copy), dma.any (a placeholder future), swap (exchange future handles), and a three-phase loop:

__co__ auto matmul(s32 [M, K] lhs, s32 [K, N] rhs) {
  s32 [lhs.span(0), rhs.span(1)] output;

  parallel {px, py} by [8, 16] : block
    parallel {qx, qy} by [16, 16] : thread {

    with tile_k in 16 {
      // Prologue: start loading tile 0
      lf0 = dma.copy.async lhs.chunkat(px, tile_k) => shared;
      rf0 = dma.copy.async rhs.chunkat(tile_k, py) => shared;

      // Placeholder futures for buffer 1
      lf1 = dma.any;
      rf1 = dma.any;

      // Steady state: load next tile while computing on current
      foreach tile_k(1:) {
        lf1 = dma.copy.async lhs.chunkat(px, tile_k) => shared;
        rf1 = dma.copy.async rhs.chunkat(tile_k, py) => shared;

        wait lf0, rf0;

        foreach k in [256 / #tile_k]
          output.at(px#qx, py#qy) += lf0.data.at(qx, k) * rf0.data.at(k, qy);

        swap(lf0, lf1);
        swap(rf0, rf1);
      }

      // Epilogue: compute on the last loaded tile
      foreach k in [256 / #tile_k]
        output.at(px#qx, py#qy) += lf0.data.at(qx, k) * rf0.data.at(k, qy);
    }
  }

  return output;
}

The three phases

Prologue. Issue loads for tile 0 into lf0/rf0. No compute yet — the first tile must land before anything can multiply it.

Steady state. For each subsequent tile: start loads into lf1/rf1, compute on lf0/rf0 from the previous iteration, then swap so names track the active buffers. New copies land in lf1/rf1 before the compute reads lf0/rf0, so you never read a buffer being overwritten.

Epilogue. After the last swap, lf0/rf0 hold the final tile; one more compute pass drains them.

swap: names, not bytes

swap(lf0, lf1) exchanges future handles — the Croqtile-level names that refer to buffers. Shared-memory contents stay where the hardware placed them; only the names rotate. In CUDA, the same idiom is often a ^ 1 buffer index or a boolean phase variable. For triple buffering, rotate(f0, f1, f2) cycles three handles in one step.

with tile_k in 16

Opens a scoped region and binds tile_k as a tile axis with extent 16. Inside the block, tile_k is the chunk index for chunkat along K, and #tile_k is 16.

dma.any: placeholder futures

dma.any creates a future that does not yet represent a transfer. It gives the type system something to swap against on the first steady-state iteration. Before any use of lf1.data, a real dma.copy has been assigned.

foreach tile_k(1:): sliced iteration

(1:) means tile indices 1, 2, ... through the end. Tile 0 was loaded in the prologue.

__co__ auto return type

__co__ auto matmul(...) lets the compiler infer the return type from return output.

This example uses s32 with scalar accumulation — a simplified style to isolate the swap mechanism. The same pattern applies to FP16/MMA kernels from Chapter 4.

Why you cannot "just overlap" without events

The swap pattern works because one thread group controls both loading and computing — it knows the order. Warp specialization (Chapter 5) puts them on different warpgroups with different program counters. They cannot share a swap schedule; they need a signaling mechanism.

The picture below contrasts a strict load-then-compute staircase with double-buffered overlap: same logical work, less idle time.

Sequential vs double-buffered K-tile timelines (schematic) Sequential vs double-buffered K-tile timelines (schematic)

The full 1P1C event pipeline

This kernel combines inthreads.async (from Chapter 5) with event arrays to build a complete multi-stage pipeline. The producer and consumer run as separate concurrent programs, coordinated entirely through wait / trigger:

#define TILE_M 128
#define TILE_N 128
#define TILE_K 128
#define WARP_M 64
#define WARP_N 64
#define WARP_K 16
#define STAGES 2

__co__ void matmul(global f16 [M, K] lhs, global f16 [N, K] rhs, global f16 [M, N] output) {
  parallel {block_m, block_n} by [cdiv(M, WARP_M), cdiv(N, WARP_N)] : block {
    shared event full[STAGES], empty[STAGES];
    shared f16 [WARP_M, TILE_K] lhs_load_s[STAGES];
    shared f16 [WARP_N, TILE_K] rhs_load_s[STAGES];
    shared f16 [WARP_M, WARP_N] output_s;

    parallel p1 by 2 : group-4 {
      inthreads.async (p1 == 0) {
        foreach {iv_k} in [cdiv(K, TILE_K)] {
          stage = iv_k % STAGES;
          wait empty[stage];
          dma.copy lhs.subspan(WARP_M, TILE_K).at(block_m, iv_k)
            => lhs_load_s[stage];
          dma.copy rhs.chunkat(block_n, iv_k)
            => rhs_load_s[stage];
          trigger full[stage];
        }
      }

      inthreads.async (p1 == 1) {
        mc = mma.fill.f16 0.0f;
        foreach {s} in [STAGES]
          trigger empty[s];
        foreach {iv_k} in [cdiv(K, TILE_K)] {
          stage = iv_k % STAGES;
          wait full[stage];
          foreach {iv_warp} in [cdiv(TILE_K, WARP_K)] {
            ma = mma.load lhs_load_s[stage].chunkat(_, iv_warp);
            mb = mma.load rhs_load_s[stage].chunkat(_, iv_warp);
            mma.row.row mc, ma, mb;
          }
          mma.commit;
          trigger empty[stage];
        }
        mma.store mc, output_s;
        dma.copy output_s => output.subspan(WARP_M, WARP_N).at(block_m, block_n);
      }
    }
  }
}

Walking through the kernel

Ring index. stage = iv_k % STAGES maps the unbounded K iteration to a fixed number of physical buffer slots — double buffering generalized to N buffers.

Producer path. For each iv_k, wait empty[stage] acquires a free slot. The dma.copy lines fill lhs_load_s / rhs_load_s at that stage. Then trigger full[stage] hands the slot to the consumer.

Consumer bootstrap. The loop foreach {s} in [STAGES] { trigger empty[s]; } runs before the K-loop so every stage starts with an empty credit. Without this, the producer blocks forever on its first wait empty — a deadlock.

Consumer path. Each iv_k: wait full[stage] blocks until the producer has filled that slot, then MMA over the tile, mma.commit, and trigger empty[stage] to release the slot for reuse.

mma.commit. Hopper WGMMA overlaps instruction issue and accumulation. mma.commit is the fence that completes one K-slab's contribution to mc before that stage's shared buffer may be reused. Omitting it risks reading stale data — the MMA might still be consuming operands when the producer overwrites the buffer.

Credit flow for one stage

The diagram matches the code: bootstrap grants empty credits; the producer waits on empty, fills, signals full; the consumer waits on full, computes, signals empty. When iv_k wraps modulo STAGES, the same physical stage re-enters the cycle.

Event credit flow for one pipeline stage Event credit flow for one pipeline stage

Debugging tip

If something looks wrong after editing a pipeline, verify event order and trip counts before chasing MMA layout bugs: producer and consumer must use the same cdiv(K, TILE_K) loop bound, and too few stages shifts pressure to wait full when the consumer outruns the producer.

New syntax

Syntax Meaning
dma.copy.async src => dst Non-blocking copy (returns immediately)
dma.any Placeholder future (no transfer in flight yet)
swap(f0, f1) Exchange two future handles without copying data
rotate(f0, f1, f2) Cycle three future handles
with tile_k in N { ... } Scoped tile axis binding with extent N
foreach tile_k(1:) Iterate starting from index 1
mma.commit Fence between pipeline stages for WGMMA
__co__ auto fn(...) Return type inferred from return statement

Summary

Pattern Primitives used When to use
swap double buffering dma.any, swap Single thread group interleaving load and compute
1P1C event pipeline inthreads.async, shared event[], wait, trigger, mma.commit Separate producer/consumer warpgroups with multi-stage pipeline

Both patterns achieve the same goal — overlapping memory and compute — but at different levels of complexity. Start with swap for simpler kernels; graduate to events when you need warp specialization.

The next chapter moves on to hardware-accelerated TMA, swizzled shared layouts, and view / from for irregular access — the same synchronization patterns from this chapter, with richer data movement primitives underneath.