CUDA Premature Optimization

One of our projects – a system that performs lengthy calculations with CUDA – started exhibiting strange behavior. It is one of those physical simulations that nobody without a masters degree in physics understands. The code was originally developed by PhD candidates in the lab, and was transferred to us when they were in the final stages of their PhDs.

Recently we’ve found out that a small portion of the calculations yield the wrong results on weak GPUs. This behavior was consistent – these calculations always yielded wrong results on weak GPUs.e

Our immediate suspect was float vs. double. The calculations are done in double precision numbers, which are better supported on stronger NVIDIA GPUs. We suspected that there are tiny rounding errors on the weaker GPUs that in some cases accumulate to a drastic error.

Debugging CUDA kernels is really not fun. It is something you do your best to avoid. We wanted to avoid this, too, so instead of trying to debug the 5 kernels involved in the calculation, we added print-outs of the parameters before and after each kernel invocation. Since the kernels were running on data in the GPU memory, we had to copy the data back to the CPU before printing it.

After spending a few hours trying to find any difference between the numbers, we realized that the problem disappeared. Adding the print-outs fixed the problem.

In C this usually means a memory corruption of some sort. However, since the calculations were running on the device (the GPU in CUDA lingo) and we only changed code on the host (the CPU), we ruled this out. So it had to be something else our print-outs did.

It was the copy operation. Copying data from the device to the host is an synchronous operation (there is an asynchronous version, but we didn’t use it for this). Adding the copy operation synchronized everything on the GPU. We investigated further and found this:

The original programmers decided to use a CUDA feature called streams. A CUDA stream is, well, a stream of operations (memory allocations, copies, kernel invocations, etc…) that run synchronously. If you have two streams, you can have two sets of these operations performed in parallel, so one stream can copy something to memory while the other is running a calculation.

The first problematic calculation had two streams, one for memory operations and the other for kernel invocations. The flow of the code was like this: Copy data to the GPU, run calculation on GPU, do some calculations on the CPU, copy data back to the CPU. They forgot to wait for the GPU calculation to complete before copying the data back to the CPU.

On strong GPUs, the GPU calculation was done before the CPU got to copying the data back. On weak GPUs the CPU started copying data while the calculation was still ongoing. This was a classic race condition. We’ve added the proper synchronization and the problem disappeared.

All the other calculation errors on weak GPUs we’ve found were of the same sort. Sometimes two streams weren’t synchronized properly, sometimes three, but it was always the same thing basic problem.

In all the calculations involving multiple streams we saw the same thing – the code works well as long as you make sure only one stream is active at any given moment. This begs the question – why use multiple streams in the first place?

“Because we thought that at some point it would be useful” was the answer.

Even though the people responsible for this have real world jobs, they agreed to write a program that prints “Premature optimization is the root of all evil” 100 times. I think that shows respect.