Debugging Bugs is a must in every programmer's job. There is a hot question-and-answer post on Quora: What's the hardest bug you ve debugged? | What's the hardest Bug you've ever debugged? "In many responses, Dave Baggett's experience was the most breathtaking, with more than 5,500 tops.
Recalling this bug, I still have some pain. As a programmer, when you find a bug, you learn to first find the problem in your own code, perhaps after testing 10,000 times, you will blame the problem on the compiler. Only after all of this doesn't work can you blame the hardware for the problem.
This is the story of a hardware bug I encountered.
Aside from anything else, I wrote the memory card (read-write) code for "Crash Bandicoot". For an arrogant game programmer, it's like a walk in the park, and I think it will be finished in a few days, and the final commissioning took six weeks. During this time I did some other things, but I kept coming back to deal with this bug--several hours a day within a few days. This bug is really annoying.
The symptom of this bug is that when you need to save your progress, the code accesses the memory card, and most of the time there is nothing wrong ... But the occasional read-write will time out ... There's no obvious reason. A short write often destroys the memory card. Players want to save the progress, not only do not save, but also erase their memory card all things. Oh, my God.
After a while, we panicked at the Sony producer Connie Booth. We obviously can't release the game with this bug, and after six weeks I have no clue what the problem is. Through Connie we ask other PS1 developers: Has anyone ever seen a situation like ours? Absolutely no one has any problems on the memory card system.
After you've racked your brains, the only way you can do this is to divide and conquer: a little bit to get rid of the code in the program until you leave little code but you still have a problem. Like wood carving to remove the code without problems, leaving is your bug.
The challenge in this context is that video games are hard to get rid of a certain part of. How do I run the game after you delete the code that simulates gravity or displays characters?
What you have to do is use a pretend to do the real thing, but actually just do a very simple thing that does not appear bug things to replace the entire module. You have to write a new support code to get these things working properly. This is a slow and painful process.
To make a long story short: I finished. I removed a large chunk of code, quite a lot, leaving only the initialization code--to prepare the game run system, initialize the underlying hardware, and so on. Of course, I can't display the load/save menu because I've truncated all the image codes. But I can pretend that the user is using (invisible) to load/save the screen and request to save and then write to the card.
I ended up with a very small amount of code with this bug--but the problem is still random! In most cases, nothing is wrong, but occasionally it fails. Basically all the actual code of the crash is removed, but this is still the case. This is really puzzling: the code left behind basically didn't do anything.
At that time--presumably 3 o'clock in the morning--an idea popped out. Read-write (I/O) involves precise timing. Whether it's hard drives, memory cards, Bluetooth transmitters--whatever--the underlying code for reading and writing is based on the clock.
The clock allows hardware devices that are not directly connected to the CPU to synchronize with the CPU running code . The clock determines the baud rate-the rate at which data is transmitted from one end to the other. If something goes wrong with the timing, the hardware or software or both will be messy. This is really, really bad, and usually results in data corruption.
What if our initialization code messed up in some way? I looked again at the code in the test program for timing and noticed that we set the programmable timer on the PS1 to 1kHz (1000 hops per second). This is relatively fast, when the PS1 boot, the default state is probably 100Hz. As a result, most games set their timers to 100Hz.
Andy, the developer of the game (and the only one outside of me), set the timer to 1kHz, making the crash's motion calculations more accurate. Andy likes to overdo it, and if we want to simulate gravity, we should try to improve the accuracy as much as possible!
But what if increasing the timer frequency somehow interferes with the timing of the entire program, so what happens when the timer is set to the baud rate of the memory card?
I commented out the timer code. Then I can't undo the bug. However, this does not mean that the bug has been fixed and the problem is random. What if I'm just lucky?
A few days later, I still play my test program. The bug did not appear again. I went back to all the crash code, modified the load/save code, reset the programmable timer to the default setting (100HZ) before accessing the memory card, and set it back to 1kHz. No problems were found since then.
But... Why?
I went back to the test and tried to detect the wrong patterns that occurred when the timer was set to 1kHz. Finally, I noticed that these errors occurred on the person using the PS1 handle. Because I seldom do this myself, I don't notice (why I use a handle when I test load/save code). But one day our art was waiting for me to finish the test (I was sure I was swearing at that time), and he was nervously fiddling with the handle. The card is damaged. "Wait, what's going on?" Hey, one more time!
Once I find out that the two things are connected, it's easy to reproduce the bug: Start writing the memory card, move the handle, and the memory card is damaged. It seems to me to be a complete hardware bug.
I'll go find Connie and tell him what I found. She relayed it to a hardware engineer who had designed PS1. She was told: "No, it can't be a hardware problem." I told her to ask if I could talk to him directly.
The engineer called me, he was using his rotten English, I used my worse Japanese, we argued for a while. I finally said: "I will give you a 30-line test program, so that you can move the handle when the problem occurs." "He promised. He assured me that it was a waste of time and that he was busy on a new project, but because we were a very important developer of Sony, he would try.
The next night (we were in Los Angeles, and he was in Tokyo, so for me it was night and he was the next day), he called me and apologized to me. This is a hardware problem.
I still don't know exactly where the problem is, but in my impression, feedback from Sony HQ is that if you set the programmable timer to a sufficiently high clock frequency, it will affect something near the clock crystal on the motherboard. One of these things is the baud rate controller of the memory card, which also sets the baud rate of the handle. I'm not a hardware, so I'm pretty vague about the details.
However, the main thrust is the two independent part of the motherboard crosstalk, as well as the handle interface and memory card interface data sent by the combination of 1kHz clock frequency will cause loss, so that the data lost, so that the card is damaged.
This is the only time in my entire programming career because of the problem of quantum mechanics debugging.
Well, after watching Dave's experience, it's everyone's turn.
Recommended Reading
Three major directions for programmers ' future development
20 Experience sharing of senior programmer programming
Analyze the programmer's several stages of growth
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Program Ape, the hardest Bug you've ever debugged?