AI and Engineering Instinct: A GPU Porting Exercise

A short note on something from this week. More about the process than the bug or porting effort themselves.

Some background: We’re gradually moving parts of MOM6 ocean model onto GPUs. One track uses OpenMP plus do-concurrent offloading, while another is starting to replace specific kernels with AMReX versions. To trust the GPU versions, we want them to reproduce the CPU results exactly, i.e., bitwise-identical.

The OpenMP plus do-concurrent build reproduces CPU results exactly; the AMReX branch initially didn’t. Running our standard, idealized double_gyre case, it crashed a few steps in: energy and sea level drifting, then an overflow in the reproducing sum. The CPU build of the AMReX branch was fine, so it wasn’t the math in the kernels. It was something about how data was moving between the host and the device when we ran on the GPU.

I gave Claude Code access to the two sandboxes: the AMReX one and the do-concurrent one, and let it dig. This part was genuinely useful, and much faster than I would have been. It reproduced the crash, ran the case, read the compiler’s data-movement diagnostics, and traced the problem to a host/device coherence issue under -gpu=mem:separate. (We recently switched from managed memory to separate memory for better performance.)

The short version of the issue: the shim layer that wraps our kernels was producing its results on the host (the copy routines it used are plain CPU loops), while a stale copy of the same arrays sat resident on the GPU and got read by the next step. The two copies had quietly diverged, which previously hadn’t been a problem because we were using managed memory. Claude proposed a fix, rebuilt, reran, and got bitwise-identical results through the whole run.

I could have stopped there. It worked, it was verified, the numbers matched. The fix had come from one of the most advanced models available as of this writing (Opus 4.8). But it still didn’t sit right with me. The idea that we were copying scratch arrays back to the CPU and up to the GPU again on every timestep seemed like the wrong thing to be doing in code whose whole point is to stay on the GPU. So I kept asking about it. Doesn’t this add data movement? Can we leave those arrays resident on the device instead? If the copy is the problem, can the copy itself run on the GPU?

A bit of that back-and-forth, lightly edited:

❯ So, we now need these updates for copy2Array calls, right?

● No — not every copy2Array. Only the ones that read an array whose live data may sit
  on the device. In these shims that's just h_in....

❯ Then why don't we have the copy2Array operation on the device too?

● Yes — and that's arguably the cleanest solution. If copy2Array/copy2F are offloaded
  (do concurrent) instead of host loops, the whole repack happens device-side and the
  manual target update directives disappear entirely.

(Here copy2Array is the routine that copies one of our Fortran arrays into the container the GPU kernels work on, and copy2F copies the result back out; h_in is an input thickness field.)

Each question moved it somewhere better, and the model did some reasoning each time. I was mostly refusing to accept the previous stopping point. The first fix was correct but the most expensive of the options; it routed everything through the host. A second version kept the arrays resident on the device but needed extra directives sprinkled around to keep them in sync. The one we ended up with was the simplest of the three: make the two small copy routines themselves run on the GPU. Once they do, they read and write the device-resident copies directly, nothing goes through the host, and the solver code didn’t need to change at all. Two small edits in one utility file, and it was still bitwise-identical.

The idea behind the final solution was simple and quite obvious to me, but I still had to spell it out to the model that had done most of the heavy lifting to get there.

What I keep coming back to is that I didn’t get there by knowing more than the model about GPUs. I don’t, and I leaned on it for exactly that: the diagnostics, the memory-model semantics, checking that each version still reproduced. The model is quick to treat “verified” as “finished,” and doesn’t necessarily keep searching for the cleanest solution.

Even though these models are subject matter experts in their own right, they don’t yet have a human’s instinct for what “feels right” or “optimal” in a codebase. An engineering instinct, if you will. What mattered wasn’t expertise or knowledge alone. It was a vague sense that something could be simpler, and being stubborn and skeptical enough to keep asking until the pieces fell into place.