What Wake-up Call?
With physical security applications rapidly following the rest of IT into the cloud, you should understand what did (and did not) happen during last week’s outage of certain Amazon Web Service (AWS) cloud-based services. Needless to say, it stirred up a predictable serving of hand-wringing and I-told-you-so’s from technology and media critics. But does it really say anything about the future of cloud computing or its application to security?
The problems began on April 21, affecting companies that run their products on services hosted in AWS server farms in the data center Mecca of Northern Virginia. According to the New York Times, sporadic outages lasted for as long as two days, and nailed companies such as FourSquare, Reddit, Quora, and BigDoor, not to mention a Federal Energy Department. Problems reported by customers included “being unable to access data, service interruptions and sites being shut down.”
This is pretty serious stuff for revenue impairment, let alone your ability to run a business at all. For physical security applications, service disruptions can be even more serious if they affect life and safety, or simply the ability to let your employees and customers into your place of business. Fortunately, most physical security systems using the cloud are, technically speaking, “hybrid” architectures where some parts of the system—usually embedded devices—run locally, on premise, and continue to do their jobs even when Internet connectivity is lost. At least the better ones are built that way.
But even for “pure cloud” companies, the AWS service issues did not have to be the livelihood-threatening experience that it was for many of them. Why? Because they all committed the same basic engineering sin: ignoring the need for redundancy and running on only one system. Amazon itself parcels infrastructure into what they call Availability Zones, which are “distinct locations that are engineered to be insulated from failures”. If you only want to pay to be hosted in one zone, then that’s all you get. If you want the higher resiliency that comes from hosting in multiple zones, then you just pay more. It’s simple: you get what you pay for. The same is true, by the way, for hosted physical security providers: some have only one data center, while others (like Brivo) have many.
Amidst all the chicken-little squawking about the Amazon outage, what usually got lost were the success stories. Among the publications that got it wrong, in my book, are: Government Computer News, reporting that this is “one example of what can go wrong with cloud computing”. Or CNN’s alarmist headline, “Why Amazon’s cloud Titanic went down.” Fortunately, though, some got it right. Roger Strukhoff of Cloud Computing Journal says it all with: “It sounds like fake cloud computing was subbed in for the real thing.” And then there’s Information Week’s sober assessment that the “Amazon cloud outage proves importance of failover planning.”
Of all the stories, though, my favorite example of doing things the right way is this article about the federal Recovery.gov website. Under the leadership of executive director Mike Wood, the Recovery Accountability and Transparency Board (RATB) managed to avoid any downtime whatsoever, even though their principal infrastructure used the same AWS services that caused problems for others. How did they do it? The oldest system design principle in the book: redundancy. They had a copy of their services running in another Availability Zone elsewhere in the US. Other sophisticated Amazon cloud users like Netflix were spared downtime through the same technique.
The simple lesson here is that the cloud does not change the requirements for good system design. At the end of the day, as Larry Ellison has been so fond of pointing out, the cloud is made up of computers and processors and memory and hard drives. And these components inevitably fail from time to time, so you need to make sure that they are redundant. If you think that the laws of engineering have somehow been suspended in the cloud, you need to put down the Kool-aid and slowly back up (literally, if you’ll excuse the pun). We’ve also published guidelines on this sort of thing before, including this list of 7 Requirements for SaaS Providers that includes “multiple, secure, disaster-tolerant data centers” right near the top of the list.
So, to all the critics, like IDC’s Matthew Eastwood, who are claiming that “this is a wake-up call for cloud computing,” I say: it’s a wake-up call that most of us received about fifty years ago at the dawn of the computer era.
If you’re just getting that wake-up call now, you’ve really over-slept.
Posted by Steve Van Till on May 03, 2011