Is High Reliability Always Good?

I’ve just read an interesting chapter written by Marc Alvidrez located on Google’s Site Reliability website all about Embracing Risk. I’m going to summarise the main points I have learnt from this - and there are a few - but I would encourage you to read the original and the remainder of the book; it’s definitely worth it.

TLDR: Sometimes it’s better to not have too much reliability because it will hamper innovation and could cost too much.

Here are my notes:

  • Unreliable systems erode confidence, so system failure should be avoided.
  • An incremental improvement in reliability may not have a linear increase in cost and could be as much as 100x as costly.
  • Costs are one of two things:
    • Redundant hardware.
    • The opportunity cost of not working on new features.
  • The cost of developing reliability may be outweighed by the inability to work on new features.
  • Services should be reliable enough, but no more than they need to be.
  • If a reliability target of 99.99% is desired, Google want to exceed it, but not by much.
  • A system with a target of 99.99% can be down for up to 52.56 minutes in a year and still meet its target.
  • Google doesn’t think of things in terms of down-time, but instead request success rate, due to the fact that they need to work across many timezones.
  • However, not all requests are equal. A service ping does not match a user sign up request!
  • Targets for availability depend on a service’s positioning and the function it provides. E.g. Google Calendar must have a higher availability because of the enterprises that depend on it, unlike YouTube in its early days, since that was more consumer oriented.
  • Other availability target factors include:
    • Types of Failure: “Which is worse for the service: a constant low rate of failures, or an occasional full-site outage?” If user data exposure is a possibility, a full-site outage might be more prudent.
    • Costs: Operating a service at one more 9 of availability can be compared against the increase in revenue gleaned from that improvement. If the revenue is less than the cost, it may be worth remaining with a lower level of availability.

Hi! Did you find this useful or interesting? I have an email list coming soon, but in the meantime, if you ready anything you fancy chatting about, I would love to hear from you. You can contact me here or at stephen ‘at’ logicalmoon.com