From Failing to Fixing: Time-Saving Approaches to Debugging

Something is wrong with your software. Your customers start to see a blank screen, but they should see your app. Not exactly what you wished. You jump to GitHub, revert your last commit, and create a new release. It doesn't help. What now?

I have seen people freezing in such situations. They are usually unprepared for problems they have not experienced yet. But with the right approach, it is possible to confront the unknown effectively.

Based on the infinite monkey theorem, we can assume that anything is fixable in an unlimited amount of time. If a poor animal can write "Macbeth" by randomly typing on a typewriter keyboard, you can randomly select places in the code and tweak them until the problem is solved. I hope no one forces an animal to type for that long, but I also don't want you to spend more time than needed to find a bug.

By following the optimal thinking process, you can save lots of time.

It's most probably not random.

You have to believe that there is a cause. Debugging is all about gathering data and prioritizing ideas. There has to be something causing the problem, and your job is to find it. Do you know where to start? Do you have any clues on what it might be? It may be dangerous to follow the first idea if you don't evaluate different possibilities. Just making a comparison of potential ideas provides you with an opportunity to think more creatively. If you have just one idea, it is the most promising one. If you have more, you can compare the effort behind validating each hypothesis on what the culprit may be.

How to generate those ideas if you have no clue? Narrow the possibilities.

Ask yourself a couple of questions.

Was the bug always there?

Do you know when the problem started to happen? Can you figure out the timeline of changes introduced around that time? Can any of your monitoring tools help you with determining that? During my time in Typeform, It wasRollbar.js providing the data on when the JS error started to occur. But most of the time, it was not enough. How about your test suite then? Unit, e2e? Is it still green, passing everywhere? If it's hard to learn about the time when someone discovered a bug, maybe it was some fault in the initial design. The bug was always with you. You could have not be aware of the problem until the moment somebody has finally noticed it.

You can strongly narrow the search with information about when exactly the application broke and search through changes history to identify suspicious ones.

Where can an error happen?

What services are involved in the flow where error happens? How does your infrastructure serving the app look? Do you know how the data flows in your application? Find someone who knows to learn more about it. Maybe it's a good time to solve the problem together?

Do you have external dependencies that may have generated a bug? In one case, the bug appeared due to a change not in our codebase, not our infrastructure, but a Google Tag Manager. It was a door for multiple other 3rd party scripts. Each piece of the code came from a different domain. Each was a potential place where something could break.

What is the area of impact?

A different approach to form a more substantial hypothesis is to answer the question about an area of impact. If the issue happens everywhere, first, you need to look at elements that can impact your app on such a scale. If only a single area is experiencing the problem, then your goal is to answer the question: what makes it so different?

If you have a state where everything is fine, you have the opportunity to compare it with the state where it's broken. You can start analyzing particular elements of your app. Think, about how possible it is, that the shared logic is faulty when you experience the issue in just one place. Understand where NOT to look because it is as essential as knowing where to look. Sometimes you will get to the bottom of things by relentlessly filtering out options that do not make sense.

Once upon a time, our team received a ticket that opening one page crashes the browser. I started removing big chunks of code to get to the moment when the browser was not crashing anymore. I found one string that contained a large text. How could the string crash the browser? I didn't know at the time. I removed half of the text, and with each halving, I got closer to learning which half contained the problem. I repeated the same steps with what was left. I went from a page to a paragraph, to a sentence, to a symbol. Safari broke because it tried to display em dash (—) from a Microsoft word document.

What do you know so far?

Write down what you learned about the bug. It's easy to forget the details and start running in circles. It also helps to share what you know with other people and help them to not duplicate work. After analyzing all the data, you should have a list of ideas on where to debug next.

Now it's the time to prioritize them.

Prioritising ideas based on effort and probability

Estimating effort and probability enables you to order the ideas and concentrate on the most promising one at the time. Your best chance is to start with the most likely hypothesis that requires the least effort. If one does not provide you with the answer, proceed to the next one. With every step, you should either add another piece to the puzzle or eliminate something that does not fit.

Be cautious before engaging in long time effort. Perhaps analyzing the situation again from the beginning might provide some new information that you missed before.

What if you don't know much about the bug?

If your current situation does not provide enough data to figure out what is happening, it's best to implement something that can reveal it. Can you add some additional logging, tracing, metrics? Extend the number of details? If the issue persists, can you provide an easy way for the customers to report it?

What you are going to do, is to follow the same steps from the beginning but this time better equipped to solve the problem.

Collect data, prioritise ideas, execute and repeat

That's it!

Providing some thinking structure will immensely improve how you solve problems. I've been noticing the same pattern over and over for many years for any bug. For sure, I did not get a chance to encounter all of them, but I hope the described flow can help to improve how you debug. In the long run, how we think about the error impacts our ability to find and solve it.