The date is September 9, 1947.
U.S. Navy officer Grace Hopper is working on a Harvard Mark II computer when she notices that it is consistently delivering errors. After checking for obvious problems, she decides to open the computer with the help of some coworkers.
Much to their surprise, the errors are caused by an actual bug. As it turns out a moth had flown into the computer and got stuck between the relays of the machine, causing it to malfunction.
Much like Grace and her coworkers, you are most likely dealing with bugs in code on a regular basis as a software engineer. From time to time, you too will run into some real head-scratchers. And to make matters worse, management could be breathing down your neck to get it fixed asap if users are running into problems on a larger scale.
In these cases, it can happen to anyone that panic sets in and they just start throwing possible fixes at the problem until something sticks. However, it is important to take a deep breath, keep calm, and follow the 4 steps outlined below instead. For cooler heads will prevail.
Before you scramble and try to fix the bug as soon as possible, make sure you can actually reproduce it first.
If possible, reproduce it on the environment where it was originally reported to make sure it wasn’t a fluke or user error.
If it was originally discovered in the production environment, you might not want to try to reproduce it there so you don’t add test data to the actual user data. If so, make sure you get as much information as possible about the circumstances that cause the bug to happen.
Next, try to reproduce it on a local copy of the app. If this is not possible on a clean installation, it might be because of one or more differences in:
- configuration (config files, env variables, …)
- infrastructure (caching, load balancer, software versions, …)
- persisted data
If you can only reproduce the bug by tweaking your local copy in one of the three ways above, that’s usually a good indicator of the cause!
Otherwise, start using whatever tools you have at your disposal to track down the cause of the bug. Some examples include:
- Log files from your local copy and/or the environment where the bug was spotted. Consider adding more logging if the bug is hard to trace.
- Stack traces in errors that show up in the UI or API responses.
- Debug statements like var_dump() and console.log() if all else fails.
A complementary approach to these tools is a diagnosis by exclusion: Verify that specific layers work as intended so you can narrow down the search to specific layers of the application.
If you have a timeframe of when the bug has started occurring, you can also look at recent changes to the codebase (assuming you’re using version control).
If the bug only happens on one or more specific branches, you’re in luck. In that case, you can probably find it by carefully going over the diff with the main branch.
3. Fix & Verify
Congratulations, you found the cause of the bug and fixed it!
Now verify that it is actually fixed and, more importantly, that your fix didn’t break anything else. Also, make sure that your fix not only works on clean installations but with previously persisted data as well.
4. Consider automated tests
At madewithlove, one of the first suggestions we make to every team we help out is to start using automated tests.
Automated tests can help you every step of the way.
- Tests can reproduce the bug in an isolated and repeatable way.
- Other tests can make sure specific layers do work as intended so you can exclude them from the possible causes.
- And once you’ve fixed the bug, tests can help verify that nothing else is broken by the change.
Of course the more tests you have the better they will be able to perform this last check. So if there’s one piece of advice we’d love you to take away from this, it’s this.