I have recently been educating myself about the cloud and infrastructure as a service (IAAS). One of the issues that came up in my reading is the outage that Amazon EC2 suffered on April 21. There seem to be two schools of thought about this outage:
- This goes to show that you can’t trust the cloud to be 100% reliable.
- Of course the cloud is not 100% reliable and the prudent thing for an engineer to do is to assume that it will fail and to architect your system so that it is resilient to failure.
I have been convinced by the second argument mainly because I have been reading the Netflix tech blog. In one post, they describe why Netflix customers were unaffected by the Amazon outage even though Netflix has moved almost all of its operations onto EC2. It comes down to, in the words of one commentator, whether 99.95% availability should be read as saying that an outage is an extremely rare occurrence or whether it should be read as a guarantee that EC2 will be unavailable .05% of the time. Netflix took the second interpretation and architected their cloud usage to survive the outage.
– Len Bass, SEI