Post Mortem

Remember when the Cloud was off earlier this month? Google published a post mortem article and root cause analysis on their Cloud Status Blog.

Most often, outages are perceived as a failure, and in particular corporate is often looking for somebody responsible to blame. There are enough situations where it is necessary to track responsibility, in particular when it comes to damage to other entities properties or even human damage resulting from product failure.

However, an important aspect in failure is the necessity to understand why failure happened, and the insight derived to avoid future occurences.

The cloud (and cloud products) often have no or little impact on the physical world, hence can allow a more positive culture when it comes to failure and blame. Google has demonstrated a positive approach to this situation.

In summary, Google’s OAuth Service was down on December 14th for 47 minutes. Authentication and Authorization is very central for any kind of service, hence all products were affected. The service itself was affected by misconfigured quotas for a database service, resulting in read failures.