The art of failure in cloud’s complex system

After reading Sydney Dekker’s insightful book, Drift into Failure, about evaluating failure in a complex systems environment, I’m currently fascinated by all of the ways cloud computing will challenge the “best practices” of the client-server era. Dekker explores disaster avoidance and evaluation in complex adaptive systems environments, where small changes in initial conditions lead to large variances in outcome, repeating individual actions rarely results in the same outcome system-wide, and pinpointing “causes” of failure is nearly impossible.

Put that into the context of IT. Think about all of the ways we have tried to lock down the environment in which an application runs, so that there are no variances in initial conditions (in theory—more on that later). For decades now, we’ve been trying to guarantee that each layer of the computing stack is an environment that is predictable, reliable, and stable.

Is cloud a poor foundation?

Unfortunately, cloud computing screws that up—big time. For clouds, resource pools create a combinatoric problem where even the most reliable individual components combine to give you a largely unstable system-wide infrastructure. If you have 1000 servers that are 99.9% reliable (about 1 failure/3 years), the resulting chance of server failure within the system as a whole will be 1-(.999^1000), or 59.9.%, or about 3 failures every five days (if my rusty math is right).

99.98% reliability (about 1 failure every 10 years for each device) at the same scale still leaves you with about 1 failure every four days.

Now apply that math to the entire cloud infrastructure, services, applications and data environment. Your rate of something in the system failing will astound you. A rock solid foundation, rigid and unchanging, is impossible.

In fact, trying to build a rigid architecture for any software system in the cloud is asking for stability. As a recent post about the stability vs reliability tradeoff (in the context of economics) makes quite clear, stability doesn’t work out quite the way you’d expect in a complex system. So, software in the cloud has to be flexible, rather than rigid.

Cloud’s answer: resiliency

A resilient foundation, on the other hand, is quite achievable, if you treat the cloud as the complex system that it is. This is the key thing that Dekker has taught me so far. When designing applications, don’t concentrate on drilling down farther and farther in the design specs trying to make sure each function is perfectly designed and optimized for a static set of conditions.

Instead, think “up and out”. For example:

  • Design the application from components that are themselves designed to survive in an unstable systems environment.
  • Make sure every call out to an API assumes that API is untrustworthy (in the performance and availability sense—if not the security and functionality sense).
  • Make sure all service functions are written to handle even the most ridiculous inputs from future clients.
  • Put limits on the amount of time you’ll wait for a remote service to respond.
  • Have backup contingencies for data if the primary source of that data is unavailable—such as falling back to a cached value, if applicable.

This is the art of building components for a complex adaptive computing environment. Think about what can go wrong with the relationships between things with the same (or greater) fervor as what can go wrong with each individual thing.

That’s the essence of how complex adaptive systems will most radically change the “best practices” of IT: the center of availability design will have to shift from the vertical stack of a single unit of deployment (an application or a service), to the relationships between all elements of the total system being operated, including the system’s relationships with other external factors, including other systems.

In future posts, I want to explore this topic in much more depth. However, to get you started, I highly recommend catching up on what Netflix is doing in this regards. Two posts in particular, “Making the Netflix API More Resilient” and “Fault Tolerance in a High Volume, Distributed System“, are especially practical if you are building a complex cloud application at scale.

Oh, and one other thing.  How fun is this going to be for enterprise architects…? 😉

Advertisements
Categories: Uncategorized

Five resources for learning complex adaptive systems

February 21, 2012 3 comments

As I get started contributing more regularly to this blog, I wanted to give everyone a baseline for understanding what I mean when I talk about complex adaptive systems (or CAS). It’s not an easy subject to master—not because the core concepts are difficult, but because the effects of those concepts on the world around as are so rich, varied and profound.

So, I quickly wanted to make everyone aware of a few resources that I consider great ways to get in touch with the core of CAS. Some of them are relatively passive—you can just read or watch them at your leisure. Others are quite interactive, and you can use them to get “hands-on” experience. All of them are the most engaging that I’ve found to date.

  1. Complexity: The Emerging Science At The Edge Of Order And Chaos, by M. Mitchell Waldrop  At this point, this book might be called a classic, given the fact that it was originally published in 1992, but it is still the most engaging introduction to the multi-disciplinary beginnings of complexity theory that I am aware of. By telling the story of the famed Santa Fe Institute, and how key members got interested in both the beginnings of complexity theory and the institute itself, Waldrop manages to bring newcomers to the subject in both an intelligent and entertaining fashion. A must read if you don’t know much of anything about complexity theory and CAS.
  2. Complex Adaptive Systems: An Introduction to Computational Models of Social Life, by John H Miller and Scott E Page – This is an excellent guide to the computational models that scientists are using to understand and experiment with complex adaptive systems. Chapter after chapter introduce you to models both simple and complex, with enough information about how the models were formed that you can create them yourself in the programming environment of your choice. However, even if you don’t choose to recreate the models, it is an eye-opening read.
  3. TUDelft SPM 4530 and SPM 9555 – “Agent Based Modeling of complex energy and industrial networks” and “Agent Based Modeling of Complex Adaptive Systems – Advanced” – I originally found this course material via iTunes, but this Wiki page is the official page of a freely available course delivered by Igor Nikolic, Assistant Professor at the Energy and Industry group, Faculty of Technology, Policy and Management faculty, Delft University of Technology. You can step through each of his lectures, as well as see the practical assignments. The only thing you can’t get access to is the actual output of the students actually taking the class, which makes a ton of sense to me. The cool thing is, the practicals use the next tool extensively. There is hours of material here, so be prepared to spend a few nights and weekends on it.
  4. NetLogo  – This is just about the coolest tool I’ve found to date for modeling complex adaptive systems. Described as a “multi-agent programmable modeling environment”, NetLogo provides a basic engine and a simple modelling language that allows you to create all kinds of amazing CAS models. It also has a big list of sample models, including classics such as flocking birds, forest fires and even a model of PageRank from Larry Page and Sergey Brin’s Stanford days. This one is really fun for understanding a large cross section of CAS modeling. One of these days, I’m hoping to create an original “from scratch” model of some aspect of cloud computing in this tool.
  5. Drift into Failure, by Sidney Dekker – I am currently reading this one, so it may be a bit premature to recommend it as a starting place. It certainly is a dense work, but what makes it important to me is it’s clear statement of situations where complexity takes many apparently good, honest decisions and aggregates them into less than desirable outcomes. This is a must read for those who get CAS to a certain extent, but need to better understand why CAS means we have to think differently about software and systems design and operations in the cloud. This is the very meat of what I think most programmers and operators don’t understand about cloud today.

There will be other resources that I will share with you as time goes by. (For example, Adrian Cockcroft of Netflix pointed me to this post which I haven’t digested yet, but which title intrigues me to no end.) In the meantime, please enjoy these, and please post comments with your own recommendations for key sources of complex adaptive systems knowledge. Or cloud knowledge, for that matter. I look forward to learning from you as much as sharing my own learning with you.

Categories: Uncategorized

Exploring cloud and complex systems

February 13, 2012 3 comments

This post—this blog—have been a long time coming.

While many may know me from my long journey exploring cloud computing (the first few years of which are archived to this blog, followed by three years at CNET writing The Wisdom of Clouds, and now my continuing work on GigaOm/cloud), for some time now I’ve been keenly interested in a more specific topic under the cloud computing umbrella, namely how cloud computing is driving application architectures to adopt the traits of complex adaptive systems.

Complex adaptive systems (CAS) are fascinating beasts. Described by complex systems pioneer John Holland as “systems that have a large numbers of components, often called agents, that interact and adapt or learn”, CAS are the reason major systems in nature work—from biology to ecology to economics and society. CAS allow for constant change, with an emphasis on changes that make the system stronger, although there is a constant risk that the system will see negative events as well.

Why this is fascinating to me is that CAS resist the ability to break them down to component parts, establishing clear cause-and-effect relationships between the agents involved and the emergent behavior of the system as a whole. In fact, the sheer complex nature of these systems means that outcomes from any given action within a system in at best unpredictable, and quite likely impossible.

An excellent example of this, that I’ve used before, is a pile of sand on a table. Imagine dropping more sand, one grain at a time, onto that pile. Quick, how many grains of sand will fall off the table with each grain added to the “system”? It is impossible to predict.

Now, granted, a table of sand isn’t exactly “adaptive”, but it is indicative of what complexity does to predictability.

CAS and IT

When applied to IT—everything from architectures to markets to organizations—the science of complex adaptive systems has real deep ramifications to the ways we plan, design, build, trouble shoot and adapt our most important applications of technology to business. We can’t do everything top-down anymore. Much, much more of our work has to be built and maintained from the bottom up.

Now, I am a novice at all this, so much of what I believe I know today will likely turn out to be wrong. As a quick example, the “Cloud as Complex Systems Architecture” presentation I am giving at Cloud Connect in Santa Clara, CA on Tuesday was supposed to center around focusing on how you automate the operations of individual software components to survive in the cloud. “Focus on tweaking agent automation”, my message was going to be.

However, that is counter to the reality of what has to happen; it is just as important to evaluate the system as a whole, and adjust *whatever* needs to be adjusted to when issues are identified system wide. In other words, an agent-level focus is exactly the kind of thing that gets you in trouble in complex systems. You see? I am still learning.

I hope you will join me as I take this journey. Follow me on Twitter at @jamesurquhart. Subscribe to the RSS feed for this blog. Leave comments. Challenge me. Point me to new sources of information. Tell me I’m full of it and should start over. In return, I promise to listen, and to share my own journey, including insights I gain from books, online courses and the other very smart people I am very lucky to interact with.

This is going to be fun. I’m pumped to get started.

The Wisdom of Clouds is moving!!!

December 8, 2008 Leave a comment

Finally! It’s been a long time coming, but the “morphing” of The Wisdom of Clouds I’ve hinted at a couple of times in the last month is finally here. Dan Farber and Margaret Kane, the good editors the CNET, have agreed to publish this blog (with a slight name change) on the CNET Blog Network. Hence forth the blog will be titled “The Wisdom of the Clouds”, and located at http://news.cnet.com/the-wisdom-of-clouds. Please go there and subscribe today.

CNET is one of the most respected IT news sources, and with about 15 million unique visitors a month this is a huge opportunity to broaden the cloud computing discussion to the mainstream IT community. The other members of the CNET Blog Network include such thought leaders as Matt Asay, Gordon Haff and Peter N. Glaskowsky, and I am humbled to be listed among them.

However, I also want to recognize and thank each of you for helping to make The Wisdom of Clouds what it is today. At the beginning of 2008, I had a little over 120 subscribers. This last week saw a record 948 subscribers, with over 200 of you reading each new post within 24 hours of it hitting the feeds, and about 50 more reading the same on the blog pages itself. It has been tremendously enriching to see the uptake in interest, and I am grateful to each of you for your interest, attention and feedback. Thank you.

Unfortunately, this transition will not be without its inconveniences. As you may have guessed, I will no longer be publishing to this site; for now http://blog.jamesurquhart.com will become an archive site for the two years or so of posts that I’ve written since early in my Cassatt days. I will frequently reference back to those posts initially, but all new material will appear at CNET. If you want to follow where the conversation goes from here, it is important that you go the the CNET URL and subscribe.

I will probably continue to publish my del.icio.us bookmarks to the existing feed for a while, but I want to consolidate that traffic with the article publications over time. Stay tuned for how that will work out. I won’t be bookmarking my own posts as a rule; thus subscribe to the new feed.

Please let me know if you have any problems or concerns with the transition, and I hope that each and every one of you will continue to be a part of my own education about the cloud and its consequences. As always, I can be reached at jurquhart at (ignore this) yahoo dot com.

Again, thank you all, and I’ll see you on The Wisdom of the Clouds.

Categories: Uncategorized

The Two Faces of Cloud Computing

December 6, 2008 Leave a comment

One of the fun aspects of a nascent cloud computing market is that there are “veins” of innovative thinking to be mined from all of the hype. Each of us discover these independently, though the velocity of recognition increases greatly as the effects of “asymmetrical follow” patterns take effect. Those “really big ideas” of cloud computing usually start as a great observation by one or a few independent bloggers. If you are observant, and pay attention to patterns in terminology and concepts, you can get a jump on the opportunities and intellectual advances triggered by a new “really big idea”.

One of these memes that I have been noticing more and more in the last week is that of the two-faceted cloud; the concept that cloud computing is beginning to address two different market needs, that of large scale web applications (the so-called “Web 2.0” market), and that of traditional data center computing (the so-called “Enterprise” market). As I’ll try to explain, this is a “reasonably big idea” (or perhaps “reasonably big observation” is a more accurate portrayal).

I first noticed the meme when I was made aware of a Forrester report titled “There Are Two Types Of Compute Clouds: Server Clouds And Scale-Out Clouds Serve Very Different Customer Needs”, written by analyst Frank E. Gillett. The abstract gives the best summary of the concept that I’ve found to date:

“Cloud computing is a confusing topic for vendor strategists. One reason? Most of us confuse two fundamentally different types of compute clouds as one. Server clouds support the needs of traditional business apps while scale-out clouds are designed for massive, many-machine workloads such as Web sites or grid compute applications. Scale-out clouds differ from server clouds in five key ways: 1) much larger workloads; 2) loosely coupled software architecture; 3) fault tolerance in software, not hardware; 4) simple state management; and 5) server virtualization is for provisioning flexibility — not machine sharing. Strategists must update their server virtualization plans to embrace the evolution to server cloud, while developing a separate strategy to compete in the arena for scale-out clouds.”

Get it? There are two plans of attack for an enterprise looking to leverage the cloud:

  • How do you move existing load to the IaaS, PaaS, and SaaS providers?
  • How do you leverage the new extremely large scale infrastructures used by the Googles and Amazons of the world to create new competitive advantage?

Around then I started seeing references to other posts that suggested the same thing; that there are two customers for the cloud: those that need to achieve higher scale at lower costs than possible before, and those that want to eliminate data center capital in favor of a “pay-as-you-go” model.

I’m not sure how revolutionary this observation is (obviously many people noticed it before it clicked with me), but it is important. Where is it most obvious? In my opinion, the three PaaS members of the “big four” are good examples:

  • Google is the sole Scale-out vendor on the list…for now. I hear rumors that Microsoft may explore this as well, but for now it is not Mr. Softy’s focus.
  • Microsoft’s focus is, on the other hand, the enterprise. By choosing a .NET centric platform, Azure, complete with Enterprise Service Bus and other integration-centric technologies, they have firmly targeted the corporate database applications that run so much of our economy today.
  • Salesforce.com is perhaps the most interesting in that they chose to compete for enterprises with force.com and Sites, but through a “move all your stuff here” approach. Great for the Salesforce.com users, but perhaps a disadvantage to those wishing to build stand-alone systems, much less those wishing to integrate with their on-premises SAP instances.

The point here, I guess, is that comparisons between Scale-out and Enterprise clouds, while sometimes tempting (especially in the Google vs. Microsoft case), are rather useless. They serve different purposes, often for completely different audiences, and enterprise IT organizations would do better to focus their efforts on the specific facet of cloud computing that applies to a given project. If you are a budding PaaS vendor, understand the distinction, and focus on the technologies required to meet your market’s demand. Don’t try to be “all cloud to all people”.

Except, possibly, if you are Microsoft…

Categories: Uncategorized

The enterprise "barrier-to-exit" to cloud computing

December 2, 2008 Leave a comment

An interesting discussion ensued on Twitter this weekend between myself and George Reese of Valtira. George–who recently posted some thought provoking posts on O’Reilly Broadcast about cloud security, and is writing a book on cloud computing–argued strongly that the benefits gained from moving to the cloud outweighed any additional costs that may ensue. In fact, in one tweet he noted:

IT is a barrier to getting things done for most businesses; the Cloud reduces or eliminates that barrier.

I reacted strongly to that statement; I don’t buy that IT is that bad in all cases (though some certainly is), nor do I buy that simply eliminating a barrier to getting something done makes it worth while. Besides, the barrier being removed isn’t strictly financial, it is corporate IT policy. I can build a kick butt home entertainment system for my house for $50,000; that doesn’t mean it’s the right thing to do.

However, as the conversation unfolded, it became clear that George and I were coming at the problem from two different angles. George was talking about many SMB organizations, which really can’t justify the cost of building their own IT infrastructure, but have been faced with a choice of doing just that, turning to (expensive and often rigid) managed hosting, or putting a server in a colo space somewhere (and maintaining that server). Not very happy choices.

Enter the cloud. Now these same businesses can simply grab capacity on demand, start and stop billing at their leisure and get real world class power, virtualization and networking infrastructure without having to put an ounce of thought into it. Yeah, it costs more than simply running a server would cost, but when you add the infrastructure/managed hosting fees/colo leases, cloud almost always looks like the better deal. At least that’s what George claims his numbers show, and I’m willing to accept that. It makes sense to me.

I, on the other hand, was thinking of medium to large enterprises which already own significant data center infrastructure, and already have sunk costs in power, cooling and assorted infrastructure. When looking at this class of business, these sunk costs must be added to server acquisition and operation costs when rationalizing against the costs of gaining the same services from the cloud. In this case, these investments often tip the balance, and it becomes much cheaper to use existing infrastructure (though with some automation) to deliver fixed capacity loads. As I discussed recently, the cloud generally only gets interesting for loads that are not running 24X7.

(George actually notes a class of applications that sadly are also good candidates, though they shouldn’t necessarily be: applications that IT just can’t or won’t get to on behalf of a business unit. George claims his business makes good money meeting the needs of marketing organizations that have this problem. Just make sure the ROI is really worth it before taking this option, however.)

This existing investment in infrastructure therefore acts almost as a “barrier-to-exit” for these enterprises when considering moving to the cloud. It seems to me highly ironic, and perhaps somewhat unique, that certain aspects of the cloud computing market will be blazed not by organizations with multiple data centers and thousands upon thousands of servers, but by the little mom-and-pop shop that used to own a couple of servers in a colo somewhere that finally shut them down and turned to Amazon. How cool is that?

The good news, as I hinted at earlier, is that there is technology that can be rationalized financially–through capital equipment and energy savings–which in turn can “grease the skids” for cloud adoption in the future. Ask the guys at 3tera. They’ll tell you that their cloud infrastructure allows an enterprise to optimize infrastructure usage while enabling workload portability (though not running workload portability) between cloud providers running their stuff. VMWare introduced their vCloud initiative specifically to make enterprises aware of the work they are doing to allow workload portability across data centers running their stuff. Cisco (my employer) is addressing the problem. In fact, there are several great products out there who can give you cloud technology in your enterprise data center that will open the door to cloud adoption now (with things like cloudbursting) and in the future.

If you aren’t considering how to “cloud enable” your entire infrastructure today, you ought to be getting nervous. Your competitors probably are looking closely at these technologies, and when the time is right, their barrier-to-exit will be lower than yours. Then, the true costs of moving an existing data center infrastructure to the cloud will become painfully obvious.

Many thanks to George for the excellent discussion. Twitter is becoming a great venue for cloud discussions.

What is the value of IT convenience?

November 29, 2008 Leave a comment

RPath’s Billy Marshall wrote a post that is closely related to a topic I have been thinking about a lot lately. Namely, Billy points out that the effect of server virtualization hasn’t been to satisfy the demand on IT resources, but simply to accelerate that demand through simplifying resource allocation. Billy gives a very clear example of what he means:

“Over the past 2 weeks, I have had a number of very interesting conversations with partners, prospects, customers, and analysts that lead me to believe that a virtual machine tsunami is building which might soon swamp the legacy, horizontal system management approaches. Here is what I have heard:

Two separate prospects told me that they have quickly consumed every available bit of capacity on their VMware server farms. As soon as they add more capacity, it disappears under the weight of an ever pressing demand of new VMs. They are scrambling to figure out how they manage the pending VM sprawl. They are also scrambling to understand how they are going to lower their VMware bill via an Amazon EC2 capability for some portion of the runtime instances.

Two prominent analysts proclaimed to me that the percentage of new servers running a hypervisor as the primary boot option will quickly approach 90% by 2012. With all of these systems sporting a hypervisor as the on-ramp for applications built as virtual machines, the number of virtual machines is going to explode. The hypervisor takes the friction out of the deployment process, which in turn escalates the number of VMs to be managed.”

The world of Infrastructure as a Service isn’t really any different:

Amazon EC2 demand continues to skyrocket. It seems that business units are quickly sidestepping those IT departments that have not yet found a way to say “yes” to requests for new capacity due to capital spending constraints and high friction processes for getting applications into production (i.e. the legacy approach of provisioning servers with a general purpose OS and then attempting to install/configure the app to work on the production implementation which is no doubt different than the development environment). I heard a rumor that a new datacenter in Oregon was underway to support this burgeoning EC2 demand. I also saw our most recent EC2 bill, and I nearly hit the roof. Turns out when you provide frictionless capacity via the hypervisor, virtual machine deployment, and variable cost payment, demand explodes. Trust me.”

Billy isn’t the only person I’ve heard comment about their EC2 bill lately. Justin Mason commented on my post, “Do Your Cloud Applications Need to be Elastic?”:

“[W]e also have inelastic parts of the infrastructure that could be hosted elsewhere at a colo for less cost, and personally, I would probably have done this given the choice; but mgmt were happier just to use EC2 as widely as possible, despite the additional costs, since it keeps things simpler.”

In each case, management chooses to pay more for convenience.

I think these examples demonstrate an important decision point for IT organizations, especially during these times of financial strife. What is the value of IT convenience? When is it wise to choose to pay more dollars (or euros, or yen, or whatever) to gain some level of simplicity or focus or comfort? In the case of virtualization, is it always wise to leverage positive economic changes to expand service coverage? In the case of cloud computing, is it always wise to accept relatively high price points per CPU hour over managing your own cheaper compute loads?

I think there are no simple answers, but there are some elements that I would consider if the choice was mine:

  • Do I already have the infrastructure and labor skills I need to do it just as well or better than the cloud? If I were to simply apply some automation to what I already have, would it deliver the elasticity/reliability/agility I want without committing a monthly portion of my corporate revenues to an outside entity?

  • Is virtualization and/or the cloud the only way to get the agility I need to meet my objectives? The answer here is often “yes” for virtualization, but is it as frequently for cloud computing?

  • Do I have the luxury of cash flow that allows for me to spend up a little for someone else to worry about problems that I would have to handle otherwise? Of course, this is the same question that applies to outsourcing, managed hosting, etc.

One of the reasons you’ve seen a backlash against some aspects of cloud computing, or even a rising voice to the “its the same thing we tried before” argument, is that much of the marketing hype out there is staring to ignore the fact that cloud computing costs money; costs enough to provide a profit to the vendor. Yes, it is true that many (most?) IT organizations have lacked the ability to deliver the same efficiencies as the best cloud players, but that can change and change quickly if those same organizations were to look to automation software and infrastructure to provide that efficiency.

My advice to you: if you already own data centers, and if you want convenience on a budget, balance the cost of Amazon/GoGrid/Mosso/whoever with the value delivered by Arjuna/3TERA/Cassatt/Enomaly/etc./etc./etc., including controlling your virtualization sprawl and preparing you for using the cloud in innovative ways. Consider making your storage and networking virtualization friendly.

Sometimes convenience starts at home.

Categories: Uncategorized