Quality has a mythical status among founders, product people, and engineers alike. Whatever we ship has to be of the highest quality.

Clean code, adherence to product requirements documents (PRDS),an immaculate design and smooth user experience (UX)... Quality has many faces, but we all seem to agree that we can't risk shipping anything but the best. When a release introduces a bug or breaks existing functionality, that's generally considered a product sin.

But that's just one way to frame quality.

Fitness for purpose

Another way to look at it is "fitness for purpose". The quality of a product is the extent to which it solves a problem-does it do what it claims to do reliably?

eBay used to be a great tool for finding and buying second-hand stuff in your neighbourhood. But, it has developed a reputation for scams and discount retail instead of being a peer-to-peer marketplace. Even if eBay is well-designed and pretty bug-free, it's no longer fit for purpose – it has a quality problem.

Facebook Marketplace, with its clunky interface and terrible seller experience, solves that problem better. It is therefore of higher quality.

That's a very different way of looking at product quality and one we might apply to our SaaS products.

Is testing everything the right approach?

Common sense tells us that all bugs should be caught before they reach production. But, when quality is really bad, that mentality often paralyses a team's ability to change things.

The typical scenario is an old product with lots of technical debt that hasn't been maintained for years. It's full of bugs, and every little change breaks something else.

The team wants to refactor to get the product in better shape, but they know that such sweeping changes will introduce new bugs. So, the kneejerk reaction will be to "test everything" after the big refactor. That means getting to 100% code coverage or involving a QA team to go through the entire application. Since the developers don't know what will break, they want to cover everything.

That's an expensive, wasteful approach. You can't prove a negative – no amount of testing will guarantee that there are no bugs.

Deep down, the team and their leadership know this. They just want to test everything for peace of mind or, in a lot of cases, to cover their asses. Nobody wants to be the one who breaks production!

Reframing quality to avoid a stalemate

As a result, nothing changes. Paying back the technical debt was already a huge endeavour, but testing everything just makes it prohibitively expensive. So we do nothing instead.

An alternative, more pragmatic way of deblocking this is reframing the question around quality. Instead of thinking about quality as catching all bugs before they reach production, we can think of it as "fit for purpose".

We're solving two problems at the same time: paying back tech debt and keeping the system stable. Those problems aren't necessarily at odds. It's not the refactoring that introduces this instability. The system is already brittle.

So, the better question is: How fast can we fix bugs when they occur?

All of a sudden, it's no longer about keeping the perfect quality score. It's about setting up a system to tackle instability as soon as it arises.

This unlocks a few pragmatic steps we need to take:

  • Communicate from the top that we will take a calculated risk with quality. We tell the team that it's OK to introduce the occasional bug as long as we can fix it quickly. While we expect the engineers to try their best to avoid outages, we won’t blame them should the service go down. 
  • Remove all gates that prevent the refactoring team from going live. Each team member should be able to push to production without friction to solve these bugs as soon as they can. Developers should get the mandate to push and ship, without the OK from QA or Product.
  • Set up monitoring and alerting tools to detect these bugs before or as soon as the user experiences them. Tools like Sentry and centralized logs are a must-have.
  • Dedicate time for responding to instability right after the release. Since we can anticipate issues, we can plan for them. The weeks after the release of the big refactor aren't business as usual. Let's keep the team on standby rather than throwing feature work at them. That way, they can spend their days monitoring and intervening when necessary. We really want them to twiddle their thumbs for weeks after the go-live, as that means the changes didn't cause major bugs.
  • Communicate to your users that there will be a big engine upgrade under the hood and ask them to report issues found directly. While nobody likes outages, at least they are planned and announced. That feels a lot more professional.
  • Finally, after patching a bug, write a unit test that proves the feature works as expected. This will cover the most brittle parts of the code base, making it more robust over time.

This approach isn't possible in every environment. Software for aeroplanes and medical devices has zero room for this instability. But most SaaS can tolerate some.

Thinking about quality in a Black-or-White way can cripple your ability to move forward. Thinking in that grey area of "fit for purpose" might be a lot more productive.