Ever since reading Zen and the Art of Motorcyle maintenance, I've realized there are lots of the stories I would like to tell about what I do on a daily basis that fall under the category of "fishing stories" or "blow-by-blow" recountings of how I solved problems. I've gotten pretty keenly tuned into the fact that most people don't really care about the details (i.e. how you caught the fish), just whether or not you figured things out. It's definitely disappointing when people zone this part out, because that's where the triumphs really lie and the learning really happens. The recent TV show Mr. Robot has an episode where they use debugging code as a theme and metaphor throughout ultimately arriving at the "debugging is all about finding the bug and understanding why the bug was there to begin with."
This post is a fishing story. It is the story of perhaps the most challenging-to-find bug I've encountered thus far in my CT recon software and marks the point at which we're pretty much ready for actual beta testing in prep for its release. To "commemorate" that, I want to share it. It was a bear to find, but I also wanted to share it to give a glimpse into a side of programming that folks may not often see and show my logical processes when I'm in the thick of it. In many senses, this part of programming is not all that different than working on a car or plumbing or running electrical wires: following the logical paths of how things should work, trying to isolate what could and couldn't affect the final results, narrowing, and ultimately identifying the source of the issue. Anyways, this is a little different than the stuff I typically do, so I hope that at least a few people enjoy it!
Philosophical implications of Mr. Robot notwithstanding, most of the debugging process truly is about finding bugs rather than fixing them, at least insofar as time is concerned. One could develop a million different sub-classes of bugs, but I think of there really only being three major categories: (1) syntax errors, (2) run-time errors (3) Results errors.
Syntax errors mean the compiler or interpreter of your code doesn't know what to do with something that you've written. In a language like C, this could mean you forgot to end a line with a semicolon, or used a variable without declaring it, or any number of things. The thing about syntax errors is that they are easy to fix: your compiler or interpreter can tell you damn near exactly where they stopped being able to understand what you were trying to do, and even sometime suggest fixes. These are the things you have to fix before your code will even begin to run.
Run-time errors occur after you've gotten your code to "build" or run. Let's say you have two input datasets, and on one of the sets your code runs fine, but on the second set you get a error about something called a SEGFAULT (short for segmentation fault). This is an example of a runtime error. You clearly didn't plan for something in the second dataset, and the segfault indicates that it was likely that the second dataset was a different size than the first, and you tried to store or read something outside of the bounds of what you told the program you would. Runtime errors for the most part are pretty straightforward to debug with the right tools. With a good debugger you can ID the line of code causing something like a segfault, or a divide-by-zero, but they can also be tough if say the segfault only occurs when running in non-debug mode (has happened to me). At least though the code will typically crash on you with a run-time error so you know something is going wrong. If it crashes, the crash source can usually be identified.
The final set of errors are what I would call results errors. Not all programs are built to generate static output data that would be called "results," but every programmer has encountered a situation where the program compiled just fine, ran fine, but just didn't do what it was supposed to do and didn't provide many clues as to why. These errors are the real troublemakers. The incorrect results are often the product of many many steps, and the error(s) can be present at just about any point. This is the point at which we really have to put on our analysis hats and get to work.
I discovered my particular error when I was benchmarking and comparing the CPU and GPU versions of my code. This code in particular is CT reconstruction code that I've been working on for a little over a year. The actual code is pretty basic in that it's all just number crunching, but the algorithms are fairly complex, prone to error, and then very sensitive to those errors. Tiny offset issues can cause wild fluctuations in your final reconstructed image that essentially make it worthless. Add to that the fact that we're hoping to use this software as a stand-in for FDA approved reconstruction software on clinical hardware and suddenly 95% correct means very very little; only 100% is acceptable (well okay, maybe 99% is also acceptable).
The CT reconstruction algorithm I'm implementing consist of three major steps (1) rebinning (2) filtering and (3) backprojection. Filtering and backprojection are fairly straightforward, but rebinning has four sub-categories that I label n_ffs, p_ffs, z_ffs, and a_ffs, and only one of those categories is used per reconstruction. It's not important that you the reader understand exactly what those are, but briefly, they correspond to different "flying focal spot" configurations. Flying focal spots are nothing more than a very clever trick to improve detector sampling in CT and if you'd like to read more about flying focal spots there are some solid resources on the web, my code actually being one of the better ones out there these days!
When I test my code, I typically need to test all four of these rebinning conditions, and that's exactly what I was doing a couple of weeks ago. Since the code should be scientifically accurate and we provide essentially two different programs (one for CPU, one for GPU), it's important that given the same input data, the produce the same output. What I found was that while the n_ffs, p_ffs, and z_ffs cases produced results that were almost identical (some differences are ok due the the GPU hardware), the a_ffs case was significantly different. After checking that I had configured the reconstructions correctly (something very very important to always do), I concluded there was an error somewhere in the code, and it clearly lay in the a_ffs case when reconstructed using the GPU.
I spent roughly half a day at first going over the two implemenations line-by-line looking for obvious discrepancies such as minus signs in the wrong places, missing offsets, improper indexing over my arrays, but didn't find anything. I had a lot of other stuff to work on, so I shelved it for the time being and just knew in the back of my head that I'd have to deal with it at some point. Sometimes taking a break from something can help you come back and see something obvious too (that was, unfortunately, not the case here).
Fastforward to yesterday.
Having finished the bulk of my other work and leaving for vacation in two days, I decided I would spend a couple of days trying to diagnose whatever error I was experiencing. From my previous experience, I know that the rebinning code is cantankerous, difficult, and where pretty much all of my prior problems had previously originated from, but I was also able to narrow it down because of how I was running the program. I had temporarily configured one executable to do rebinning and filtering on the CPU with backprojection on the GPU and another to do everything the GPU. Thus, I could rule out the backprojection being the culprit of this error. My gut told me that, while technically a possibility, filtering was most likely not the source of my troubles.
I got out my analytic knife (if you haven't read Zen and the Art of Motorcycle Maintenance you really should...) and I disabled the filtering in both codes to see what would happen. For those not diagnostically inclined, if the problem were in the filtering, by removing the filtering process and looking at output images similar to the images above, we should see the problem disappear (i.e. the two reconstructions will be the exact same, and the difference image on the far right will be solid gray). Now, you'll notice the images now look like they have halos which is what happens when you remove the filtering step, but alas, we're still seeing differences between the two images. This means that whatever we're up against has to lie exclusively in the GPU a_ffs rebinning code.
Ok, so emacs tells me that the a_ffs GPU code is 450 lines of code split between two files. How long could it possibly take to find the issue? Hah!
After my previous attempt to nail down an obvious issue, I figured it was something buried a little more deeply. For posterity however, I did one more check of each line to make sure there wasn't anything glaring. There wasn't. Now it was time to start looking at the incremental outputs of the rebinning algorithms. This means that we want to check in on what our variables look like at different points throughout the execution and make sure they're behaving how we'd expect. The a_ffs rebinning itself has three steps: (1) extract the raw data and reshape it (2) rebin in one direction (3 and kinda 4) rebin in the other direction and then filter the data. After each of these steps, I have an option to write the current version of our raw data to disk. Looking at these data dumps and looking for differences between the CPU and GPU should help narrow down further where the trouble lies. I like to visualize the raw data as something called a "sinogram" and that's what the images I'm going to show are of. Again, it's not critical that you understand exactly what a sinogram is, but understand that at every step of the rebinning process we want to compare the CPU and GPU implementations by looking at difference images, where we do a pixel-by-pixel subtraction and note any significant differences. In this work, we have a sinogram for every detector row present, so 64 to compare and help us ID and discrepancies.
(NOTE: I'm only going to post a limited subset of some of these images because they're large and somewhat painful to generate. They also all look VERY similar to someone not getting paid to stare at them all day...)
The first set to look at is the finished rebin to see if anything glaringly obvious stands out. Quickly scrolling through each set individually didn't yield any great insight into the problem, so I took the difference and compared that. Lo and behold, every other row had a sinogram that was wildly different between the GPU and CPU. The other important thing to notice is that the error is very systematic. Ok, now we're getting somewhere.
I didn't realize it at the time, but this really narrowed it down to a very small block of code that could be causing my trouble. In order for it to effect only every other row, the data could only be operated on by two of the 7 or 8 kernels (GPU functions) comprising all together ~ 20 lines of code. Then there was the host code that configured these which maybe added about 20-30 more lines of code. While this now seems retroactively obvious, I was getting a little desperate and perhaps not thinking 100% analytically. I had already pored over those exact lines of code again and again looking for errors. I tried just randomly flipping things around to see if I could produce any output that even remotely effected the error I was getting. Nada. Just nothing.
This led me to think that maybe I had the wrong data in the arrays that contributed to the projections that were going back, so I compared the raw data I was getting in the CPU and GPU implementations. Nothing. They were exactly the same. Finally I compared the sinograms after the first rebin (step 2 as listed a little earlier) and the error had already shown up there so that settled it. Whatever was happening was happening after the raw data had been loaded and reshaped, but before the rebinning in the second direction.
Now I was completely despondent. "But I checked those lines of code already!" I kept repeating to myself (along with some other choice words...). About two hours went by of staring, fiddling, compiling, running, swearing, more fiddling, more compiling, more swearing... etc. At about 4 PM I finally decided to do other work before I went home.
With about 30 minutes left in the day, I couldn't just let such a glaring error sit there and went back to it. It's luck I did.
One of the things I was investigating originally when making sure the results matched between the CPU and GPU was how the GPU's texturing hardware was impacting the finished product. Rebinning is essentially just doing millions, maybe billions of interpolations. GPU's have built-in hardware dedicated to interpolations that make them extremely fast at doing them, however they use 9 bit fixed-point arithmetic instead of a full-on floating point linear interpolation. While there is a slight difference (see figures below), this is diagnostically insignificant (<1HU) and the speed gains are more important to the reconstruction process. Texturing hardware takes some configuration, but once they're configured they really don't take too much more effort. On a whim, I decided that the textures were configured properly and yep, they were... wait:
See if you can spot it.
Anything? On the left, we have our texture configurations and on the right, we have our texture declarations. In the a_ffs rebinning we use all four of those textures. Still don't see it?
I had forgotten to configure textures tex_c and tex_d meaning they were left to do whatever type of interpolations their hearts desired (most likely nearest neighbor, or were interpreting my floating point arguments I was passing them as normalized coordinates). I had just completely taken it for granted that they were set up properly and hadn't though to check them in, literally, months! Arrrghhhh. The following fix takes care of it:
And we rebuild our code and run our GPU/CPU compare script....
And voila! Two identical images (well, up to floating vs fixed point differences).
For me there are a lot of interesting aspects to this experience. For one, I found and fixed to other fairly significant errors in the process that I would not have found if I hadn't had to go through all of this. Second, it really took eliminating every single other possible cause of this error before I finally happened upon the actual cause. I really had not even considered texture configuration since I initially wrote the code months ago. Just goes to show that when debugging you really can't take anything for granted.
Second, this is the second instance of a software interface doing something unexpected without throwing warnings I've experienced in the last couple of weeks. The first was in a MATLAB toolbox that was accepting case insensitive options without throwing errors, but actually requiring case-sensitive parameters (error handling is important guys!). The second was this texture issue. I an appreciate that it's nice to have a default configuration, but it might alleviate this issue in the future if the compiler threw a (toggle-able) warning about declared yet unconfigured textures. That would probably piss a bunch of CUDA programmers off ('I never USED to have to check-in my textures prior to running them and the default configuration ALWAYS worked perfectly for me!'), but hey, if it could save someone the experience I just went through I'd put up with compiler warnings for the rest of my life.
Anyways, I hope someone gets something out of this long LONG ramble-y fishing story about debugging. It was cathartic to write. It's also nice to talk about reconstruction, even just to the ether, since it's what I spend ~40 hours of my week working on with only a handful of people in the world familiar enough with the topic to really get into this nitty gritty stuff.
I'm off to pack for NC.