Archive for the ‘disaster recovery’ Category

Cloud Outages, and Why *You* Have To Design For Failure

I haven’t posted for a while because I have been thinking…a lot…about cloud computing, inevitable data center outages, and what it means to application architectures. Try as I might to put the problem on the cloud providers, I keep coming back to one bare fact; the cloud is going to expose a lot of the shortcomings of today’s distributed architectures, and this time it’s up to us to make things right.

It all started with some highly informative posts from the Data Center Knowledge blog chronicling outages at major hosting companies, and failures that helped online companies learn important lessons about scaling, etc. As I read these posts, the thought that struck my mind was, “Well, of course. These types of things are inevitable. Who could possibly predict every possible negative influence on an application, much less a data center.” I’ve been in enough enterprise IT shops to know that even the very best are prepared for something unexpected to happen. In fact, what defines the best shops are that they assume failure and prepare for it.

Then came the stories of disgruntled employees locking down critical information systems or punching the emergency power kill switch on their way out the door. Whether or not you are using the cloud, human psychology being what it is, we have to live every day with immaturity or even just plain insanity.

Yet, each time one of the big name cloud vendors has an outage–Google had one, as did Amazon a few times, including this weekend–there are a bunch of IT guys crying out, “Well, there you go. The cloud is not ready for production.”

Baloney, I say. (Well, I actually use different vocabulary, but you get the drift.) Truth is, the cloud is just exposing people’s unreasonable expectations for what a distributed, disparate computing environment provides. The idea that some capacity vendor is going to give you 100% up time for years on end–whether they promised it or not–is just delusional. Getting angry at your vendor for an isolated incident or poo-pooing the market in general just demonstrates a lack of understanding of the reality of networked applications and infrastructure.

If you are building an application for the Internet–much less the cloud–you are building a distributed software system. A distributed system, by definition, relies on a network for communication. Some years ago, Sun’s Peter Deutsch and others at Sun postulated a series of fallacies that tend to be the pitfalls that all distributed systems developers run into at one time or another in their career. Hell, I still have to check my work against these each and every time I design a distributed system.

Key among these is the delusion that the network is reliable. It isn’t, it never has been, and it never will be. For network applications, great design is defined by the application or application system’s ability to weather undesirable states. There are a variety of techniques for achieving this, such as redundancy and caching, but I will dive into those in more depth in a later post. (A great source for these concepts is

Some of the true pioneers in the cloud realized this early. Phil Wainwright notes that Alan Williamson of Mediafed made what appears to be a prescient decision to split their processing load between two cloud providers, Amazon EC2/S3 and FlexiScale. Even Amazon themselves use caching to mitigate S3 outages on their retail sites (see bottom of linked post for their statement).

Michael Hickins notes in his E-Piphanies blog that this may be an amazing opportunity for some skilled entrepreneurs to broker failure resistance in the cloud. I agree, but I think good distributed system hygiene begins at home. I think the best statement is a comment I saw on ReadWriteWeb:

“People rankled about 5 hours of downtime should try providing the same level of service. In my experience, it’s much easier to write-off your own mistakes (and most organizations do), than it is to understand someone else’s — even when they’re doing a better job than you would.”

Amen, brother.

So, in a near future post I’ll go into some depth about what you can do to utilize a “cloud oriented architecture”. Until then, remember: Only you can prevent distributed application failures.

Greasing the skids…Simplifying Datacenter Migration

January 25, 2007 Leave a comment

Here are a couple of fun buzzwords that have created all kinds of interesting headaches in IT of late: “rationalization” and “consolidation”. I’m not talking about servers here…I’ve covered that somewhat earlier. Instead, I’m talking about datacenter rationalization and consolidation.

This is a huge trend amongst Fortune 500 companies. In my work, I keep hearing VPs of Operations/Infrastructure and the like saying things like “we are consolidating from [some large number of] datacenters to [some small number, usually 2 or 3] datacenters.” In the course of these migrations, they are rationalizing the need for each application that they must migrate from one datacenter to another.

The cost of these migrations can be staggering. “Fork-lifting” servers from one site to another incurs costs in packaging, shipping and replacing damaged goods (hardware in this case). Copying an installation from one datacenter to another involves the same issues: packaging (how to capture the application at the source site and unpack it at the destination site), shipping (costs around bandwidth use or physical shipping to move the application package between sites) and repair of damaged goods (fixing apps that “break” in the new infrastructure).

What if something could “grease the skids” of these moves–reduce the cost and pain of migrating code from one datacenter to another?

One approach is to package your software payloads as images that are portable between hardware, network and storage implementations. Now the cost of packaging the application is taken care of, the cost of shipping the package remains the same or gets cheaper, and the odds of the software failing to run are greatly reduced because it is already prepared for the changing conditions of a new set of infrastructure.

Admittedly, the solution here is more related to decoupling software from hardware than Service Level Automation, per se. But a good Service Level Automation environment will act as an enabler for this kind of imaging, as it too has to solve the problem of creating generic “golden” images that can boot on a variety of hardware using a variety of network and storage configurations. In fact, I have run into several customers in the last couple of months that have a) recognized this advantage and b) rushed to get a POC going to prove it out.

Of course, if you can easily move software images between datacenters, simpler disaster recovery can’t be far behind…