The Most Infuriating Computer Programming Bug I Ever Fixed

The Most Infuriating Computer Programming Bug I Ever Fixed
This post was published on the now-closed HuffPost Contributor platform. Contributors control their own work and posted freely to our site. If you need to flag this entry as abusive, send us an email.
MichaelSvoboda/Getty Images

What is the most interesting bug you have ever solved in a computer program? originally appeared on Quora: the place to gain and share knowledge, empowering people to learn from others and better understand the world.

Answer by Udayan Banerji, Software Engineer, on Quora:

I spent two weeks investigating a bug, and the fix was a one line change.

While working as a compiler engineer at Intel, I once got assigned a weird bug. It was an Android app, basically a Java benchmark, and it would randomly crash. The app had one button, and clicking that button started a long running execution of the whole suite of benchmarks.

  1. I did not have the source code of the app, but I could see the bytecode. I first tried running it through the debugger. I tried at least thirty times and it never crashed.
  2. I ran the app normally again, and it randomly crashed. Eventually I figured out that it crashed on every twentieth time I ran the benchmark.
  3. I scoured through the bytecodes for anything that had twenty in it. Any loops of twenty, any recursions. Nothing. The program kept crashing.
  4. This was getting serious now. It seemed easier to just smash the computer keyboard on this Android phone and make the pain go away.
  5. After a weekend, I came back to the issue. I went back to the crash in Java. The core issue was an assertion failure - a large floating point number was not equal to NaN (“Not a Number”).
  6. I went back to the bytecodes and looked for floating point divisions. One by one, I isolated about a dozen of the bytecode sequences, converted them to x86 assembly, put each in a long running loop, and executed them. Finally, one of them crashed every twentieth time. I could see the light at the end of my carpal tunnel.
  7. I analyzed the assembly code and saw eight divide by zero operations. Aha! Divide by zeros produce NaN! So our compiler's divide by 0 is broken … umm, somehow?
  8. Except no, a handwritten assembly divide by zero worked fine. Frustrated I did a loop of twenty divide by zeros, and it passed as well. I then wrote a bunch of random assembly instructions after those, and the first one gave wrong result.
  9. Wait what?
  10. Finally, went to gdb and dumped the value of all CPU registers for these operations.
  11. It was then that I noticed a trend. The x87 register stack was filling up slowly, and then staying put at capacity (8 items).
  12. Turns out, there was a bug in the ancient x87 processor in the chip, the one responsible for doing floating point operations. We were using it in the compiler for all floating point operations, and all but the divide by zero path was emptying it after use.
  13. It seems on a stack overflow it did not throw an error, but returned a value of NaN no matter what you ran through it. Which is also the value you get when you divide by zero. (Basically the stack overflow error, called stack fault, is sticky. Once it happens, you have to manually clear it in the compiler or it keeps happening).
  14. So after every eight divide by zero, it will fill up, and then it will treat any operation as a divide by zero, and return NaN.
  15. The fix took one line of code change to clear the stack on the divide by zero path.

If you want to see the actual change, here it is: Gerrit Code Review. Note that most changes are comments. There are four lines of code changes, but three are identical and one is loading a value.

This question originally appeared on Quora - the place to gain and share knowledge, empowering people to learn from others and better understand the world. You can follow Quora on Twitter, Facebook, and Google+. More questions:

Popular in the Community

Close

What's Hot