OpenCL and All Zero Output Arrays

This is kind of like a PSA for new-to-OpenCL programmers.  If that description doesn’t fit you, then skipping this post won’t be any great loss.

One of the issues I recently ran up against with my OpenCL code was all of my output arrays coming out as a solid block of zeros. Every single one. It was incredibly frustrating and seemed to fail randomly.  The only thing I was changing was the size of the input and output arrays, but as a neared what seemed to be some undetermined critical threshold of something, the program execution would inexplicably work sometimes and then fail others.  All I knew for certain was that smalls arrays worked and large arrays did not.  No error codes, no nothing.

After digging around the internet, the “issue” was largely undocumented with the exception of a couple of forum posts that really aren’t related.  So, based off of my personal experience, I’m here to provide one source of “documentation.”  None of this is official. Keep that in mind.

I’ve found that this error can be caused by a couple of things:

1) Improper array indexing.  For instance, if “A” is an array, A[-1] would cause the kernel to fail in this way.  These failures are usually pretty quick and just about as soon as you try to run your kernel everything will return.  Unless it takes a while to get to those negative numbers.

2) Improper casting of variables.  See https://devtalk.nvidia.com/default/topic/452560/cuda-programming-and-performance/simple-opencl-program-compiles-and-runs-gives-incorrect-output/#entry952629 for an example.  They go into a lot of detail so I won’t here.

3) Here’s the one that nearly made me throw my laptop off of the roof of my building: Nvidia’s built-in watchdog timer for GPUs connected to displays.  Let’s talk about this one in more detail.

The watchdog timer, despite being deeply frustrating in a lot of cases, is a good thing.  Basically, when your host program deploys a kernel, the GPU ceases to compute the information to drive your display and focuses on the calculation(s) that you’ve given it; put another way: your display freezes while the GPU computes.  To prevent you from having to do a hard reset if you write bad code (which is very easy to do) the watchdog timer forces the kernel to return if it takes longer than some amount of time.

Now, from my digging, it sounds like in CUDA when this happens, you’ll get a little error message telling you that’s what happened.  Perhaps in more modern versions of OpenCL (I’m only running 1.0 on my laptop) you get something like this too. But if you DON’T, as in my case, you end up with a situation where you have an irreproducible error that you can spend a week trying to debug.

If you want to verify that timeouts are actually what you’re encountering, write a simple kernel that should produce data output, but is actually an infinite loop.  Something like:

__kernel void timeout_test(__global float *debug){
    debug[0]=1.0f;
    for (int i=0;i>0;i++){
    }
}

When you deploy this kernel, we’d expect the first element of the “debug” array to be one when the function returns (obviously it won’t).  This program should run until the watchdog timer kills it.  Debug[0] will equal zero and I bet it will have taken approximately the same amount of time for the program to “execute” as your other programs will have taken before they failed.  For my machines, it’s about 10-12 seconds.

There’s good news and bad news about this.  If you’re a Windows or a Linux user, supposedly by editing one of your register files you can lengthen the watchdog timer as needed.  Us Mac users are out of luck, although I did see one discussion on the Khronos Group (the body that sets the OpenCL standard) forums that said they would look into somehow adapting this feature in the future (although, given that it’s also an Nvidia thing, I’m not sure how likely this is).   I definitely wouldn’t disable it entirely.

There are a couple of different solutions to this although you’ll have to decide what works best for you.  I’m working on CT image reconstruction so I’ll put little examples in that context.

(1) Break your calls to clEnqueueNDRangeKernel up into small global work sizes.  

On my machine, each call to clEnqueueNDRangeKernel is independently subject to the watchdog timer.  My global work size is ideally {2304,736,32}, but there’s no way in hell my computer will process this. I was able to get around the kernel timeouts by breaking things up into separate calls to GPU.  Instead of enqueueing one huge batch, I sequentially enqueued four NDRangeKernel with global sizes of {288,736,32} (this is just an example).  The only downside was that my datasets are so large that the time it takes to copy them over in between kernel executions offsets any gain I get from running them on a GPU.  

(2) Code your kernel more efficiently

If your code is:

__kernel void add(__global float *a, __global float *b, __global float *c){
    int i=get_global_id();
    c[i]=a[i]+b[i];
}

there’s probably not a lot you can do to make that run faster.  But if your code looks more like:

__kernel void complex_kernel(__global float *raw, __global float *data, …)
    for (int i=0;i<15/23.0f;i++)
        if (raw[i]<sizeof(float)){
            for (int j=0; …
                ….
                    …
                        …
                            …
            }}}}}
        }
    }
}

you should maybe look into what type of computing GPUs are good at and also some method for code optimization.  Unrolling loops, using vector operations, and many other things can make GPU computing much more efficient.  I’d recommend pickup up a copy of OpenCL in Action by Matthew Scarpino (http://www.manning.com/scarpino2/).

An example from my coding: swapping out a linear array search to a binary search allowed me to double my input/output array size before I ran into the watchdog timer.  It still didn’t get me to where I wanted to be, but it definitely made for a more efficient kernel.

(3) Add a second GPU to your setup

Probably not what most of us want to hear, but a second GPU in compute mode will just churn away at these calculations for as long as you want with no timeouts.  If serious GPU computing is what you want, this is where you’ve gotta go.

I hope this saves you from heaving your laptop off a bridge… or maybe even just a little bit of hair pulling.

<3,

John