Archive for March, 2012

The art of failure in cloud’s complex system

After reading Sydney Dekker’s insightful book, Drift into Failure, about evaluating failure in a complex systems environment, I’m currently fascinated by all of the ways cloud computing will challenge the “best practices” of the client-server era. Dekker explores disaster avoidance and evaluation in complex adaptive systems environments, where small changes in initial conditions lead to large variances in outcome, repeating individual actions rarely results in the same outcome system-wide, and pinpointing “causes” of failure is nearly impossible.

Put that into the context of IT. Think about all of the ways we have tried to lock down the environment in which an application runs, so that there are no variances in initial conditions (in theory—more on that later). For decades now, we’ve been trying to guarantee that each layer of the computing stack is an environment that is predictable, reliable, and stable.

Is cloud a poor foundation?

Unfortunately, cloud computing screws that up—big time. For clouds, resource pools create a combinatoric problem where even the most reliable individual components combine to give you a largely unstable system-wide infrastructure. If you have 1000 servers that are 99.9% reliable (about 1 failure/3 years), the resulting chance of server failure within the system as a whole will be 1-(.999^1000), or 59.9.%, or about 3 failures every five days (if my rusty math is right).

99.98% reliability (about 1 failure every 10 years for each device) at the same scale still leaves you with about 1 failure every four days.

Now apply that math to the entire cloud infrastructure, services, applications and data environment. Your rate of something in the system failing will astound you. A rock solid foundation, rigid and unchanging, is impossible.

In fact, trying to build a rigid architecture for any software system in the cloud is asking for stability. As a recent post about the stability vs reliability tradeoff (in the context of economics) makes quite clear, stability doesn’t work out quite the way you’d expect in a complex system. So, software in the cloud has to be flexible, rather than rigid.

Cloud’s answer: resiliency

A resilient foundation, on the other hand, is quite achievable, if you treat the cloud as the complex system that it is. This is the key thing that Dekker has taught me so far. When designing applications, don’t concentrate on drilling down farther and farther in the design specs trying to make sure each function is perfectly designed and optimized for a static set of conditions.

Instead, think “up and out”. For example:

  • Design the application from components that are themselves designed to survive in an unstable systems environment.
  • Make sure every call out to an API assumes that API is untrustworthy (in the performance and availability sense—if not the security and functionality sense).
  • Make sure all service functions are written to handle even the most ridiculous inputs from future clients.
  • Put limits on the amount of time you’ll wait for a remote service to respond.
  • Have backup contingencies for data if the primary source of that data is unavailable—such as falling back to a cached value, if applicable.

This is the art of building components for a complex adaptive computing environment. Think about what can go wrong with the relationships between things with the same (or greater) fervor as what can go wrong with each individual thing.

That’s the essence of how complex adaptive systems will most radically change the “best practices” of IT: the center of availability design will have to shift from the vertical stack of a single unit of deployment (an application or a service), to the relationships between all elements of the total system being operated, including the system’s relationships with other external factors, including other systems.

In future posts, I want to explore this topic in much more depth. However, to get you started, I highly recommend catching up on what Netflix is doing in this regards. Two posts in particular, “Making the Netflix API More Resilient” and “Fault Tolerance in a High Volume, Distributed System“, are especially practical if you are building a complex cloud application at scale.

Oh, and one other thing.  How fun is this going to be for enterprise architects…? 😉

Categories: Uncategorized