After a highly intense year at work, Sarah has gone on a much-needed two-week holiday. Although Louise, her CTO, had assured her she should take the opportunity to recharge, Sarah had told her team she would be reachable by phone in an emergency.
Sarah lands in Vietnam on Monday evening; she is jetlagged. She goes straight to her hotel, and when she goes to bed, she realises she must have left her phone in the taxi. She knows she probably can't get it back, so she decides to sleep and will deal with it tomorrow.
Sarah works for CalUp, an early-stage startup in the Netherlands. They plan to gear up for a seed round later in the year, and stability will be important for their investors.
In Amsterdam, Louise is making a coffee in the office when her phone buzzes. She sees several emails coming through to the company's support email. At the same time, she notices monitoring alerts coming through on Slack. She panics. The payment integration solution used within the application is failing.
She speaks with the other developers who realise an environment variable needs updating after the payment provider released a new API version. Nobody knows which of the 25 variables are safe to modify or what the new value should be. They have got this incorrect in the past, resulting in data loss.
After a tense discussion with the team, they decided to place an alert notifying users that transactions may fail. Louise emails their customers to apologise for a temporary issue with payment processing. Eight hours later, after many complaints from customers, Louise receives an email from Sarah explaining about her lost phone. She is then able to walk the team through the update required.
This is the world of SPOFs (Single points of failure), where crucial information for your company lives in the head of one individual.
What is a SPOF?
"A SPOF is a component within a system which, if fails, causes the entire system to fail." (Li, 2018).
A SPOF (single point of failure) can manifest in many different ways, but it is when a crucial aspect of your business is dependent on either an individual, a specific process or a system. If it is an individual, they are often the ones who have been with the company since its inception; they have been on the whole journey. Consequently, they have a complete understanding of the what and the why.
SPOFs can be an issue for startups because teams are usually small, and individuals often wear many hats. One thing that exponentially worsens an SPOF and the scale of the risk attached? A lack of good documentation.
Add documentation to the equation
"We just have too much on our plates right now. Once the release is out next month, we will focus on documenting everything."
That’s what the CTO told Sarah a month before she left for Vietnam. SPOFs are not isolated to a product's technical aspects, either. Let's explore other ways they could manifest.
The technical aspects
When only one developer understands a critical system or process you are in a vulnerable position. These can manifest in the form of custom scripts, undocumented APIs, or legacy systems that nobody on the team understands. Let's fast-forward for a moment. If Sarah creates some custom functionality and then leaves the company, how is anyone on the team going to understand it fully? If documentation exists from the start, whoever replaces Sarah will be able to understand how decisions were made.
The operational aspects
As with code, important product knowledge, business processes, or customer relationships can depend heavily on an individual. This creates risk if these individuals become unavailable. CalUp has Jessica as its CFO. How will the company process the end-of-month financials if she has a family bereavement? What would happen if Fred, their Head of Product, had an accident? He is the only one who has the roadmap stored. The problem is that it is stored in his head.
What is the impact of a SPOF?
For CalUp, the impact was a day's worth of lost transactions, angry customer relationships and a lot of stress.
Some of those relationships could be unrecoverable. Other impacts could include an investor pulling out or reducing the company's valuation. It could lead to delays in releases, overtime, resignations from your team members, and emergency consulting costs. It also means handing opportunities to the competition.
When SPOFs exist, critical moments for a company simply become more stressful by default. For the team, it is no fun being dependent on an individual, and for the individual, it is unnecessarily stressful to be so relied on.
Mitigating the issue with a documentation culture
The key for the business is reducing risk. If vital information is documented, the overall risk of the issues becomes lower. This means that if a key component does fail, the organisation can quickly and elegantly recover.
You may be wondering where to start and what to document. Split up the documentation into technical, operational and product.
Technical documentation should include more than simple setup instructions. Architectural diagrams, a glossary of domain terminology, dependencies, and recovery procedures should be included. Failure scenarios can be written up along with the details of escalation pathways and who is responsible. API specifications are also shared when engineers collaborate, so this should also be considered. Swagger can help here. Internal product documentation will document decisions, troubleshooting guides, and workflows. The product vision should be clearly correlated for the team to refer to. Roadmaps and relevant matters around it and then also technical debt are additional aspects to include. Operational documentation should have enough details for an individual to successfully onboard themselves without relying too much on others' involvement. This should include step-by-step guides for key processes, role-specific responsibilities, access to necessary tools and systems and also troubleshooting procedures and company policies.
The best time to address an SPOF is before a critical issue arise such as the one with CalUp. Documenting early means knowledge is preserved. When the documentation is there, everyone, current or future, can fully understand something, positively impacting knowledge sharing and continuous learning. You also reduce the number of questions the team need to ask and particularly for juniors on the team speed up the process of them becoming more independent and valuable to the company.
When Sarah tells Dave something and Dave then tells Lauren-his junior colleague, the message probably changes in the retelling. As we work to prevent any knowledge silos, we safeguard essential parts of information from being lost or miscommunicated. When knowledge held by an individual is documented it becomes an asset to the company. When the information is centralised, it is available to everyone twenty-four hours a day—reducing lost time spent searching for information. That part becomes critical in a moment of crisis.
It is unrealistic to expect perfection from documentation, especially in an early-stage company but if the mindset is one of growth and evolution as opposed to in the moment or completion, the company gradually becomes more robust. Ensure that documentation itself does not become a SPOF by undertaking regular review cycles. Dont make it an after thought.
How did CalUp learn from their experiences?
The company referred to the incident as 'The phone loss incident'. Although it was a bad experience for the company, they focused on turning it into something positive. They instigated documentation Fridays. This was an initiative where the team spent Friday afternoons creating documentation. Roadmaps and the product vision were shared throughout the company, and everyone knew where to find them. The developers added videos for critical aspects of the code. Jessica, the CFO, added step-by-step guides for processes, and Sarah added ADRs (architecture design records) to ensure the context behind decisions was documented.
Although the growth of documentation occurred gradually, it quickly impacted the company. Soon, templates were ready for different requirements, and with everyone on the team contributing, it boosted collaboration. The company was sure to review the documentation regularly, too. After all, out-of-date documentation is almost as bad as no documentation. You are essentially sending someone on a chase for false knowledge. Lastly, guidelines should be created on what should be documented and by whom. As it grows, analytics are helpful to see who uses what the most or least.
The role of leadership
When you are ready to create a documentation culture, you need to ensure the team doesn't view it as a lower priority than other work. Explore who the team usually depends on for answers to questions and what could break if they left. This should expose what knowledge exists only in the heads of individuals. Create policies around the documentation. If you view it as creating the foundation, it can be built upon as you go. Prevent gatekeepers and push.
When we perform our due diligence audits, we see many companies with poor documentation. This can decrease a valuation.
When CalUp reached their investment round the investor explained that to invest in early-stage companies, they have to evaluate the operational efficiency, the strength of the team, and their potential for scaling. They went on to say that their experience with CalUp was not like that of the companies they often see. They felt comfortable that what they were creating was being built to scale, and the documentation culture and maturity de-risked their investment.
Conclusion
Documentation means the team can remain aligned yet work individually, making the company more robust. If the investment round for CalUp had occurred eight months prior to 'The phone loss incident,' the company's valuation could have been reduced. These SPOFs are by no means isolated to startups, in all cases, the due diligence processes explore these risks. If it was discovered that the entire codebase was handled by such a small team with no documentation available, the startup that could have sounded promising quickly became a risk.
Member discussion