I’ve just read an interesting chapter written by Marc Alvidrez located on Google’s Site Reliability website all about Embracing Risk. I’m going to summarise the main points I have learnt from this - and there are a few - but I would encourage you to read the original and the remainder of the book; it’s definitely worth it.
TLDR: Sometimes it’s better to not have too much reliability because it will hamper innovation and could cost too much.
Here are my notes:
- Unreliable systems erode confidence, so system failure should be avoided.
- An incremental improvement in reliability may not have a linear increase in cost and could be as much as 100x as costly.
- Costs are one of two things:
- Redundant hardware.
- The opportunity cost of not working on new features.
- The cost of developing reliability may be outweighed by the inability to work on new features.
- Services should be reliable enough, but no more than they need to be.
- If a reliability target of 99.99% is desired, Google want to exceed it, but not by much.
- A system with a target of 99.99% can be down for up to 52.56 minutes in a year and still meet its target.
- Google doesn’t think of things in terms of down-time, but instead request success rate, due to the fact that they need to work across many timezones.
- However, not all requests are equal. A service ping does not match a user sign up request!
- Targets for availability depend on a service’s positioning and the function it provides. E.g. Google Calendar must have a higher availability because of the enterprises that depend on it, unlike YouTube in its early days, since that was more consumer oriented.
- Other availability target factors include:
- Types of Failure: “Which is worse for the service: a constant low rate of failures, or an occasional full-site outage?” If user data exposure is a possibility, a full-site outage might be more prudent.
- Costs: Operating a service at one more 9 of availability can be compared against the increase in revenue gleaned from that improvement. If the revenue is less than the cost, it may be worth remaining with a lower level of availability.
Hi! Did you find this useful or interesting? I have an email list coming soon, but in the meantime, if you ready anything you fancy chatting about, I would love to hear from you. You can contact me here or at stephen ‘at’ logicalmoon.com