Archive

Archive for the ‘architecture’ Category

Do Your Cloud Applications Need To Be Elastic?

November 22, 2008 Leave a comment

I got to spend a few hours at Sys-Con’s Cloud Computing Expo yesterday, and I have to say it was most certainly an intellectually stimulating day. Not only was just about every US cloud startup represented in one way or another, but included were an unusual conference session, and a meetup of fans of CloudCamp.

While listening in on a session, I overheard one participant ask how the cloud would scale their application if they couldn’t replicate it. This triggered a strong response in me, as I really feel for those that confuse autonomic infrastructures with magic applied to scaling unscalable applications. Let me be clear, the cloud can’t scale your application (much, at least) if you didn’t design it to be scaled. Period.

However, that caused me to ask myself whether or not an application had to be horizontally scalable in order to gain economically while running in an Infrastructure as a Service (IaaS) cloud. The answer, I think, is that it depends.

Chris FlexFleck of Citrix wrote up a pretty decent two part explanation of this on his blog a few weeks ago. He starts out with some basic costs of acquiring and running 5 Quad-core servers–either on-premises (amortized over 3 years at 5%) or in a colocation data center–against the cost of running equivalent “high CPU” servers 24X7 on Amazon’s EC2. The short short of his initial post is that it is much more expensive to run full time on EC2 than it is to run on premises or in the colo facility.

How much more expensive?

  • On-premises: $7800/year
  • Colocation: $13,800/year
  • Amazon EC2: $35,040/year

I tend to believe this reflects the truth, even if its not 100% accurate. First, while you may think “ah, Amazon…that’s 10¢ a CPU hour”, in point of fact most production applications that you read about in the cloud-o-sphere are using the larger instances. Chris is right to use high CPU instances in his comparison at 80¢/CPU hour. Second, while its tempting to think in terms of upfront costs, your accounting department will in fact spread the capital costs out over several years, usually 3 years for a server.

In the second part of his analysis, however, Chris notes that the cost of the same Amazon instances vary based on the amount of time they are actually used, as opposed to the physical infrastructure that must be paid for whether it is used or not (with the possible exception of power and AC costs). This comes into play in a big way if the same instances are used judiciously for varying workloads, such as the hybrid fixed/cloud approach he uses as an example.

In other words, if you have an elastic load, plan for “standard” variances on-premises, but allow “excessive” spikes in load to trigger instances on EC2, you suddenly have a very compelling case relative to buying enough physical infrastructure to handle excessive peaks yourself. As Chris notes:

“To put some simple numbers to it based on the original example, let’s assume that the constant workload is roughly equal to 5 Quadcore server capacity. The variable workload on the other hand peaks at 160% of the base requirement, however it is required only about 400 hours per year, which could translate to 12 hours a day for the month of December or 33 hours per month for peak loads such as test or batch loads. The cost for a premise only solution for this situation comes to roughly 2X or $ 15,600 per year assuming existing space and a 20% factor of safety above peak load. If on the other hand you were able to utilize a Cloud for only the peak loads the incremental cost would be only $1,000. ( Based on Amazon EC2 )

Premise Only
$ 15,600 Annual cost ( 2 x 7,800 from Part 1 )
Premise Plus Cloud
$ 7,800 Annual cost from Part 1
$ 1,000 Cloud EC2 – ( 400 x .8 x 3 )
$ 8,800 Annual Cost Premise Plus Cloud “

The lesson of our story? Using the cloud makes the most sense when you have an elastic load. I would postulate that another option would be a load that is not powered on at full strength 100% of the time. Some examples might include:

  • Dev/test lab server instances
  • Scale-out applications, especially web application architectures
  • Seasonal load applications, such as personal income tax processing systems or retail accounting systems

On the other hand, you probably would not use Infrastructure as a Service today for:

  • That little accounting application that has to run at all times, but has at most 20 concurrent users
  • The MS Exchange server for your 10 person company. (Microsoft’s multi-tenant Exchange online offering is different–I’m talking hosting your own instance in EC2)
  • Your network monitoring infrastructure

Now, the managed hosting guys are going to probably jump down my throat with counter arguments about the level of service provided by (at least their) hosting clouds, but my experience is that all of these clouds actually treat self-service as self-service, and that there really is very little difference between do-it-yourself on-premises and do-it-yourself in cloud.

What would change these economics to the point that it would make sense to run any or all of your applications in an IaaS cloud? Well, I personally think you need to see a real commodity market for compute and storage capacity before you see the pricing that reflects economies in favor of running fixed loads in the cloud. There have been a wide variety of posts about what it would take [pdf] to establish a cloud market in the past, so I won’t go back over that subject here. However, if you are considering “moving my data center to the cloud”, please keep these simple economics in mind.

Why the Choice of Cloud Computing Type May Depend On Who’s Buying

November 15, 2008 Leave a comment

Thanks to Ron K. Jeffries’ Cloudy Thinking blog, I was directed to Redmonk’s Stephen O’Grady (who I now subscribe to directly) and his excellent post titled Cloud Types: Fabric vs Instance. Stephen makes an excellent observation about the nature of Infrastructure as a Service (called increasingly “Utility Computing” by Tim O’Reilly followers) and Platform as a Service (that one remains consistent). His observation is this:

“…Tim seems to feel that they are aspects of the types, while I’m of the opinion that they instead define the type. For example, by Tim’s definition, one characteristic of Utility Computing style clouds is virtual machine instances, where my definitions rather centers on that.

Here’s how I typically break down cloud computing styles:

Fabric

Description: A problematic term, perhaps, because a few of the vendors employ it towards different ends, but I use it because it’s descriptive. Rather than deploy to virtualized instances, developers building on this style cloud platform write instead to a fabric. The fabric’s role is to abstract the underlying physical and logical architecture from the developer, and – typically – to assume the burden of scaling.
Example: Google App Engine

Instance

Description: Instance style clouds are characterized by, well, instances. Unlike the fabric cloud, there is little to no abstraction present within instance based clouds: they generally recreate – virtually – a typical physical infrastructure composed of instances that include memory, processing cycles, and so on. The lack of abstraction can offer developers more control, but this control is typically offered at the cost of transparent scaling.
Example: Amazon EC2″

I love that distinction. First, for those struggling to see how Amazon/GoGrid/Flexiscale/etc. relates to Google/Microsoft/Salesforce.com/Intuit/etc., it delineates a very clear difference. If you are reserving servers on which to run applications, it is IaaS. If you are running your application free of care about which and how many resources are consumed, then it is PaaS. Easy.

However, I am even more excited by a thought that occurred to me as I read the post. One of the things that this particular distinction points out is the likelihood that the buyers of each type would be different classes of enterprise IT professionals.

Its not black and white, but I would be willing to bet heavily that :

  • The preponderance of interest in IaaS is from those whose primary concern is system administration; those with complex application profiles, who want to tweak scalability themselves, and who want the freedom to determine how data and code get stored, accessed and acted upon.

  • The preponderance of interest in PaaS is from those whose primary concerns is application development; those with a functional orientation, who want to be more concerned about creating application experiences than worrying about how to architect for deployment in a web environment (or whatever the framework provides).

In other words, server jockeys chose instances, while code jockeys choose fabric.

Now, the question quickly becomes, if developers can get the functionality and scalability/reliability/availability required from PaaS, without hiring the system administrators, why would any enterprise choose IaaS unless they were innovating at the architecture level? On the other hand, if all you want to do is add capacity to existing functionality, or you require an unusual or even innovative architecture, or you need to guarantee that certain security and continuity precautions are in place, why would you ever choose PaaS?

This, in turn, boils right back down to the PaaS spectrum I spoke of recently. Choose your cloud type based on your true need, but also take into account the skill set you will require. Don’t focus on a single brand just because it’s cool to your peers. Pick IaaS if you want to tweak infrastructure, otherwise by all means find the PaaS platform that best suits you. You’ll probably save in the long run.

Now, I’ve clearly suppressed the fact that developers probably still want some portability…though I must note that choosing a programming language alone limits function portability. (Perhaps that’s OK if the productivity values out weigh the likelihood of having to port.) Also, the things that system administrators are doing in the enterprise are extremely important, like managing security, data integrity and continuity. There are no guarantees that any of the existing PaaS platforms can help you with any of that.

Something to think about, anyway. What do you think? Will developers lean towards PaaS, while system administrators lean towards IaaS? Who will win the right to choose within the enterprise?

Why I Think CohesiveFT’s VPN-Cubed Matters

October 28, 2008 Leave a comment

You may have seen some news about CohesiveFT’s new product today–in large part thanks the the excellent online marketing push they made in the days preceding the announcement. (I had a great conversation with Patrick Kerpan, their CTO.) Normally, I would get a little suspicious about how big a deal such an announcement really is, but I have to say this one may be for real. And so do others, like Krishnan Subramanian of CloudAve.

CohesiveFT’s VPN-Cubed is targeting what I call “the last great frontier of the cloud”, networking. Specifically, it is focusing a key problem–data security and control–in a unique way. The idea is that VPN-Cubed gives you software that allows you to create a VPN of sorts that is under your personal control, regardless of where the endpoints reside, on or off the cloud. Think of it as creating a private cloud network, capable of tying systems together across a plethora of cloud providers, as well as your own network.

The use case architecture is really very simple.


Diagram courtesy of CohesiveFT

VPNCubed Manager VMs are run in the network infrastructure that you wish to add to your cloud VPN. The manager then acts as a VPN gateway for the other VMs in that network, who can then communicate to other systems on the VPN via virtual NICs assigned to the VPN. I’ll stop there, because networking is not my thing, but I will say it is important to note that this is a portable VPN infrastructure, which you can run on any compatible cloud, and CohesiveFT’s business is to create images that will run on as many clouds as possible.

Patrick made a point of using the word “control” a lot in our conversation. I think this is where VPN-Cubed is a game changer. It is one of the first products I’ve seen target isolating your stuff in someone else’s cloud, protecting access and encryption in a way that leaves you in command–assuming it works as advertised…and I have no reason to suspect otherwise.

Now, will this work with PaaS? No. SaaS? No. But if you are managing your applications in the cloud, even a hybrid cloud, and are concerned about network security, VPN-Cubed is worth a look.

What are the negatives here? Well, first I think VPN is a feature of a larger cloud network story. This is the first and only of its kind in the market, but I have a feeling other network vendors looking at this problem will address it in a more comprehensive solution.

Still, CohesiveFT has something here: it’s simple, it is entirely under your control, and it serves a big immediate need. I think we’ll see a lot more about this product as word gets out.

Cisco’s Nexus 1000v and the Cloud: Is it really a big deal?

September 17, 2008 Leave a comment

Yesterday, the big announcements at VMWorld 2008 were about Cloud OSes. Today, the big news seemed to be Maritz’s keynote (where he apparently laid out an amazing vision of what VMWare thinks they can achieve in the coming year), and the long rumored Cisco virtual switch.

The latter looks to be better than I had hoped for functionally, though perhaps a little more locked in to VMWare than I’d like. There is an explanation for the latter, however, so it may not be so bad…see below.

I’ve already explained why I love the Nexus concept so much. Today, Cisco and VMWare jointly announced the Nexus 1000v virtual machine access switch, a fully VI compatible software switch that…well, I’ll let Cisco’s data sheet explain it:

“The Cisco Nexus™ 1000V virtual machine access switch is an intelligent software switch implementation for VMware ESX environments. Running inside of the VMware ESX hypervisor, the Cisco Nexus 1000V supports Cisco® VN-Link server virtualization technology, providing

  • Policy-based virtual machine (VM) connectivity
  • Mobile VM security and network policy, and
  • Non-disruptive operational model for your server virtualization, and networking teams.

When server virtualization is deployed in the data center, virtual servers typically are not managed the same way as physical servers. Server virtualization is treated as a special deployment, leading to longer deployment time with a greater degree of coordination among server, network, storage, and security administrators. But with the Cisco Nexus 1000V you can have a consistent networking feature set and provisioning process all the way from the VM to the access, aggregation, and core switches. Your virtual servers can use the same network configuration, security policy, tools, and operational models as physical servers. Virtualization administrators can leverage predefined network policy that follows the nomadic VM and focus on virtual machine administration. This comprehensive set of capabilities helps you to deploy server virtualization faster and realize its benefits sooner.”

In other words, the 1000v is a completely equal player in a Cisco fabric, and can completely leverage all of the skill sets and policy management available in its other switches. Think “my sys admins can do what they do best, and my network admins can do what they do best”. Further more, it supports VN-Link, which allows VMWare systems running on Cisco fabric to VMotion without losing any network or security configuration. Read that last sentence again.

(I wrote some time about about network administrators facing the most change by this whole pooled-resource thing–this feature seals the deal. Those static network maps they used to hang on your wall, showing them exactly what system was connected to what switch port with what IP address are now almost entirely obsolete.)

I love that feature. I will love it even more if it functions in its entirety in the vCloud concept that VMWare is pitching, and all indications are that it will. So, to tell the story here as simply as possible:

  • You create a group of VMs for a distributed application in VConsole
  • You assign network security and policy via Cisco tools, using the same interface as on the physical switches
  • You configure VMWare to allow VMs for the application to get capacity from an external vendor–one of dozens supporting vCloud
  • When an unexpected peak hits, your VM cluster grabs additional capacity as required in the external cloud, without losing network policy and security configurations.

Cloud computing nirvana.

Now, there are some disappointments, as I hinted above. First, the switch is not stackable, as originally hoped, though the interconnectivity of VN-Link probably overrides that. (Is VN-Link just another way to “stack” switches? Networking is not my strong point.)

Update: In the comments below, Omar Sultan of Cisco notes that the switches are, in fact, “virtually stackable”, meaning they can be distributed across multiple physical systems, creating a single network domain for a cluster of machines. I understand that just enough to be dangerous, so I’ll stop there.”

More importantly, I was initially kind of ticked off that Cisco partnered so closely with VMWare without being careful to note that they would be releasing similar technologies with Citrix and Red Hat at a minimum. But, as I thought about it, Citrix hitched its wagon to 3TERA, and 3TERA owns every aspect of the logical infrastructure an application runs on. In AppLogic, you have to use their network representation, load balancers, and so on as a part of your application infrastructure definition, and 3TERA maps those to real resources as it sees fit. For network connections, it relies on a “Logical Connection Manager (LCM)“:

“The logical connection manager implements a key service that abstracts intercomponent communications. It enables AppLogic to define all interactions between components of an application in terms of point-to-point logical connections between virtual appliances. The interactions are controlled and tunneled across physical networks, allowing AppLogic to enforce interaction protocols, detect security breaches and migrate live TCP connections from one IP network to another transparently.”

(from the AppLogic Grid Operating System Technical Overview: System Services)

Thus, there is no concept of a virtual switch, per se, in AppLogic. A quick look at their site shows no other partners in the virtual networking or load balancing space (though Nirvanix is a virtual storage partner), so perhaps Cisco simply hasn’t been given the opportunity or the hooks to participate in the Xen/3TERA Cloud OS.

(If anyone at 3TERA would like to clarify, I would be extremely grateful. If Cisco should be partnering here, I would be happy to add some pressure to them to do so.)

As for Red Hat, I honestly don’t know anything about their VMM, so I can’t guess at why Cisco didn’t do anything there…although my gut tells me that I won’t be waiting long to hear about a partnership between those two.

This switch makes VMWare VMs equal players in the data center network, and that alone is going to disrupt a lot of traditional IT practices. While I was at Cassatt, I remember a colleague predicting that absolutely everything would run in a VM by the end of this decade. That still seems a little aggressive to me, but a lot less so than it did yesterday.

Elements of a Cloud Oriented Architecture

In my post, The Principles of Cloud Oriented Architectures, I introduced you to the concept of a software system architecture designed with “the cloud” in mind:

“…I offer you a series of posts…describing in depth my research into what it takes to deliver a systems architecture with the following traits:

  1. It partially or entirely incorporates the clouds for at least one layer of the Infrastructure/Platform/Application stack.
  2. Is focused on consumers of cloud technologies, not the requirements of those delivering cloud infrastructures, either public or private (or even dark).
  3. Takes into account a variety of technical, economic and even political factors that systems running in the “cloud” must take into account.
  4. Is focused at least as much on the operational aspects of the system as the design and development aspects

The idea here is not to introduce an entirely new paradigm–that’s the last thing we need given the complexity of the task ahead of us. Nor is it to replace the basic principles of SOA or any other software architecture. Rather, the focus of this series is on how to best prepare for the new set of requirements before us.”

I followed that up with a post (well, two really) that set out to define what our expectations of “the cloud” ought to be. The idea behind the Cloud Computing Bill of Rights was not to lay out a policy platform–though I am flattered that some would like to use it as the basis of one— but rather to set out some guidelines about what cloud computing customers should anticipate in their architectures. In this continuing “COA principles” series, I intend to lay out what can be done to leverage what vendors deliver, and design around what they fail to deliver.

With that basic framework laid out, the next step is to break down what technology elements need to be considered when engineering for the cloud. This post will cover only the list of some such elements as I understand them today (and feel free to use the comments below to add your own insights), and future posts will provide a more thorough analysis of individual elements and/or related groups of elements. The series is really very “stream of consciousness”, so don’t expect too much structure or continuity.

When considering what elements matter in a Cloud Oriented Architecture, we consider first that we are talking about distributed systems. Simply utilizing Salesforce.com to do your Customer Relationship Management doesn’t require an architecture; integrating it with your SAP billing systems does. As your SAP systems most likely don’t run in Salesforce.com data centers, the latter is a distributed systems problem.

Most distributed systems problems have just a few basic elements. For example:

  • Distribution of responsibilities among component parts

  • Dependency management between those component parts

  • Scalability and reliability

    • Of the system as a whole
    • Of each component
  • Data Access and Management

  • Communication and Networking

  • Monitoring and Systems Management

However, because cloud computing involves leveraging services and systems entirely outside of the architect’s control, several additional issues must be considered. Again, for example:

  • How are the responsibilities of a complex distributed system best managed when the services being consumed are relatively fixed in the tasks they can perform?

  • How are the cloud customer’s own SLA commitments best addressed when the ability to monitor and manage components of the system may be below the standards required for the task?

  • How are the economics of the cloud best leveraged?

    • How can a company gain the most work for the least amount of money?
    • How can a company leverage the cloud marketplace for not just cost savings, but also increased availability and system performance?

In an attempt to address the more cloud-specific distributed systems architecture issues, I’ve come up with the following list of elements to be addressed in a typical Cloud Oriented Architecture:

  • Service Fluidity – How does the system best allow for static redeployment and/or “live motion” of component pieces within and across hardware, facility and network boundaries? Specific issues to consider here include:

    • Distributed application architecture, or how is the system designed to manage component dependencies while allowing the system to dynamically find each component as required? (Hint: this problem has been studied thoroughly by such practices as SOA, EDA, etc.)
    • Network resiliency, or how does the system respond to changes in network location, including changes in IP addressing, routing and security?
  • Monitoring – How is the behavior and effectiveness of the system measured and tracked both to meet existing SLAs, as well as to allow developers to improve the overall system in the future? Issues to be considered here include:

    • Load monitoring, or how do you measure system load when system components are managed by multiple vendors with little or know formal agreements of how to share such data with the customer or each other?
    • Cost monitoring, or how does the customer get a accurate accounting of the costs associated with running the system from their point of view?
  • Management – How does the customer configure and maintain the overall system based on current and ongoing technical and business requirements? Examples of what needs to be considered here includes:

    • Cost, or what adjustments can be made to the system capacity or deployment to provide the required amount of service capacity at the lowest cost possible? This includes ways to manage the efficiency of computation, networking and storage.
    • Scalability, or how does the system itself allow changes to capacity to meet required workloads? These changes can happen:
      • vertically (e.g. get a bigger box for existing components–physically or virtually)
      • horizontally (e.g. add or remove additional instances of one or more components as required)
      • from a network latency perspective (adjust the ways in which the system accesses the network in order to increase overall system performance)
    • Availability, or how does the system react to failure or any one component, or any group of components (e.g. when an entire vendor cloud goes offline)?
  • Compliance – How does the overall system meet organizational, industry and legislative regulatory requirements–again, despite being made up of components from a variety of vendors who may themselves provide computing in a variety of legal jurisdictions?

Now comes the fun of breaking these down a bit, and talking about specific technologies and practices that can address them. Please, give me your feedback (or write up your criticism on your own blog, but link here so I can find you). Point me towards references to other ways to think about the problem. I look forward to the conversation.

The Principles of a Cloud Oriented Architecture

The market is hot. The technologies are appearing fast and furious. The tools you need are out there, but they are young, often untested, and always deliver unpredictable reliability. You’ve researched the economics, and you know now that cloud computing is a) here to stay, and b) offers economic advantages that–if realized–could stretch you IT budget and quite possibly catapult your career.

Now what?

What is often overlooked in the gleeful rush to cloud computing is the difficulty in molding the early technologies in the space into a truly bulletproof (or even bullet-resistant) business infrastructure. You see it all over the Internet; the push and pull between innovation and reliability, the concerns about security, monitoring and control, even the constant confusion over what entails cloud computing, what technologies to select for a given problem, and how to create an enterprise-class business system out of those technologies.

The truth is, cloud computing doesn’t launch our technical architectures into the future. It is, at its heart, an economic model that drives the parameters around how you acquire, pay for and scale the infrastructure architectures you already know. Its not a question of changing the required problems to solve when utilizing data centers, just a change to the division of responsibilities amongst yourself, your organization, your cloud providers and the Internet itself.

To this end, I offer you a series of posts (perhaps moving to a WIKI in the near future) describing in depth my research into what it takes to deliver a systems architecture with the following traits:

  1. It partially or entirely incorporates the clouds for at least one layer of the Infrastructure/Platform/Application stack.
  2. Is focused on consumers of cloud technologies, not the requirements of those delivering cloud infrastructures, either public or private (or even dark).
  3. Takes into account a variety of technical, economic and even political factors that systems running in the “cloud” must take into account.
  4. Is focused at least as much on the operational aspects of the system as the design and development aspects

The idea here is not to introduce an entirely new paradigm–that’s the last thing we need given the complexity of the task ahead of us. Nor is it to replace the basic principles of SOA or any other software architecture. Rather, the focus of this series is on how to best prepare for the new set of requirements before us.

Think about it. We already deal (or try to deal) with a world in which we don’t entirely have control over every aspect of the world our applications live in. If we are software developers, we rely on others to build our servers, configure our networks, provide us storage and weld them all together into a cohesive unit. System administrators are, in large enterprises anyway, specializing in OS/application stacks, networking, storage or system management. (Increasingly you can add facilities and traditional utilities to this list.)

Even when we outsource to others–shifting responsibility for management of parts or all of our IT infrastructure to a vendor–the vendor doesn’t have control over significant elements of the end-to-end operations of our applications; namely, the Internet itself. But with outsourcing, we typically turn over entire, intact architecture stacks, with a few, very well bounded integration points to manage (if any) between outsourced systems and locally maintained systems.

The cloud is going to mess this up. I say this not just because the business relationship is different from outsourcing, but also because what you are “turning over” can be a *part* of a system stack. Smugmug outsources storage and job processing, but not the web experience that relies on both. Applications that run entirely on EC2/S3 outsource the entire infrastructure, but not the application development, or even the application system management. (This is why RightScale, Hyperic and others are finding some traction with AWS customers.)

To prepare for a cloud oriented architecture, one understand what responsibilities lie where. So, I’ll give you a teaser of what is to come with the short-short version of where I see these responsibilities lying (subject to change as I talk to others, including yourselves if you choose to comment on this post):

  • The enterprise has responsibility for the following:
    • Defining the business solution to be solved, the use cases that define that solution, and the functional requirements to deliver those use cases
    • Evaluating the selection of technical and economic approaches for delivering those functional requirements, and selecting the best combination of the two. (In other words, the best combination may not contain either the best technical or best economic selection, but will outweigh any other combination of the two.)
    • Owning the service level agreements with the business for the delivery of those use cases. This is critically important. More on this below.
  • The cloud provider has responsibility for the following:
    • Delivering what they promised you (or the market) that they would deliver. No more, no less.
    • Providing you with transparent and honest billing and support services.
  • The Internet itself is only responsible for providing you with an open, survivability reliable infrastructure for interconnecting the networks you need to run your applications and/or services. There are no promises here about reliability or scalability or even availability. It should be considered a technical wilderness, and treated accordingly.

Now, about SLAs. Your cloud provider does not own your SLAs, you do. They may provide some SLAs that support your own, but they are not to be blamed if you fail to achieve the SLAs demanded of you. If your applications or services fail because the cloud failed, you failed. Given that, don’t “outsource” your SLAs, at least not logically. Own them.

In fact, I would argue that the single most important function of a cloud-centric IT shop after getting required business functionality up and running in the first place, is monitoring and actively managing that functionality; switching vendors, if necessary, to continue service at required levels. The one big piece of IT-specific software that should always run in IT data centers, in my opinion, is the NOC infrastructure. (Although, perhaps in this context its more of a Cloud Operations Center, but I hate the resulting acronym for obvious reasons.)

I’ll focus more on these responsibilities in future posts. All posts in this series will be tagged “coa principles”. Please feel free to provide me feedback in the comments, contact me to review your thoughts on this topic, or simply to send me links that you think I should be aware of. I am also working to find other bloggers who wish to take ownerships of parts of this primer (cloud security, for example) so let me know if you are interested there as well.

I am excited about this. This body of knowledge (or at least the faint traces of knowledge) have been rattling inside my head for some time, and it feels good to finally be sharing them with you.

Cloud Outages, and Why *You* Have To Design For Failure

I haven’t posted for a while because I have been thinking…a lot…about cloud computing, inevitable data center outages, and what it means to application architectures. Try as I might to put the problem on the cloud providers, I keep coming back to one bare fact; the cloud is going to expose a lot of the shortcomings of today’s distributed architectures, and this time it’s up to us to make things right.

It all started with some highly informative posts from the Data Center Knowledge blog chronicling outages at major hosting companies, and failures that helped online companies learn important lessons about scaling, etc. As I read these posts, the thought that struck my mind was, “Well, of course. These types of things are inevitable. Who could possibly predict every possible negative influence on an application, much less a data center.” I’ve been in enough enterprise IT shops to know that even the very best are prepared for something unexpected to happen. In fact, what defines the best shops are that they assume failure and prepare for it.

Then came the stories of disgruntled employees locking down critical information systems or punching the emergency power kill switch on their way out the door. Whether or not you are using the cloud, human psychology being what it is, we have to live every day with immaturity or even just plain insanity.

Yet, each time one of the big name cloud vendors has an outage–Google had one, as did Amazon a few times, including this weekend–there are a bunch of IT guys crying out, “Well, there you go. The cloud is not ready for production.”

Baloney, I say. (Well, I actually use different vocabulary, but you get the drift.) Truth is, the cloud is just exposing people’s unreasonable expectations for what a distributed, disparate computing environment provides. The idea that some capacity vendor is going to give you 100% up time for years on end–whether they promised it or not–is just delusional. Getting angry at your vendor for an isolated incident or poo-pooing the market in general just demonstrates a lack of understanding of the reality of networked applications and infrastructure.

If you are building an application for the Internet–much less the cloud–you are building a distributed software system. A distributed system, by definition, relies on a network for communication. Some years ago, Sun’s Peter Deutsch and others at Sun postulated a series of fallacies that tend to be the pitfalls that all distributed systems developers run into at one time or another in their career. Hell, I still have to check my work against these each and every time I design a distributed system.

Key among these is the delusion that the network is reliable. It isn’t, it never has been, and it never will be. For network applications, great design is defined by the application or application system’s ability to weather undesirable states. There are a variety of techniques for achieving this, such as redundancy and caching, but I will dive into those in more depth in a later post. (A great source for these concepts is http://highscalability.com.)

Some of the true pioneers in the cloud realized this early. Phil Wainwright notes that Alan Williamson of Mediafed made what appears to be a prescient decision to split their processing load between two cloud providers, Amazon EC2/S3 and FlexiScale. Even Amazon themselves use caching to mitigate S3 outages on their retail sites (see bottom of linked post for their statement).

Michael Hickins notes in his E-Piphanies blog that this may be an amazing opportunity for some skilled entrepreneurs to broker failure resistance in the cloud. I agree, but I think good distributed system hygiene begins at home. I think the best statement is a comment I saw on ReadWriteWeb:

“People rankled about 5 hours of downtime should try providing the same level of service. In my experience, it’s much easier to write-off your own mistakes (and most organizations do), than it is to understand someone else’s — even when they’re doing a better job than you would.”

Amen, brother.

So, in a near future post I’ll go into some depth about what you can do to utilize a “cloud oriented architecture”. Until then, remember: Only you can prevent distributed application failures.