Archive

Archive for the ‘service levels’ Category

Cloud Summit Executive: Making "Pay-As-You-Go" Customers Happy

October 15, 2008 Leave a comment

I got to spend a few hours yesterday at the Cloud Summit Executive conference at the Computer History Museum in Mountain View, California, today. The Summit was unique, at least in the Bay Area, for its focus on the business side of cloud computing. The networking was quite good, with a variety of CxOs digging into what the opportunities and challenges are for both customers and providers. The feedback that I got from those that attended the entire day was that the summit opened eyes and expanded minds.

Ken Oestreich has great coverage of the full summit.

I attended the afternoon keynotes, which started with a presentation from SAP CTO Vishal Sikka which seemed at first like the usual fluff about how global enterprise markets will be addressed by SaaS (read “SAP is the answer”). However, Vishal managed to weave in an important message about how the cloud will initially make some things worse, namely integration and data integrity. Elasticity, he noted, is the key selling point of the cloud right now, and it will be required to meet the needs of a heterogeneous world. Software vendors, pay attention to that.

For me, the highlight of the conference was a panel led by David Berlind at TechWeb, consisting of Art Wittmann of Information Week, Carolyn Lawson, CIO of the California Public Utilities Commission, and Anthony Hill, CIO of Golden Gate University. Ken covers the basic discussion quite well, but I took something away from the comments that he missed: both CIOs seemed to be in agreement that the contractual agreement between cloud customer and cloud provider should be different than those normally seen in the managed hosting world. Instead of being service level agreement (SLA) driven, cloud agreements should base termination rights strictly on customer satisfaction.

This was eye opening for me, as I had always focused on service level automation as the way to manage up time and performance in the cloud. I just assumed that the business relationship between customer and provider would include distinct service level agreements. However, Hill was adamant that his best SaaS relationships were those that gave him both subscription (or pay-as-you-use) pricing and the right to terminate the agreement for any reason with 30 days notice.

Why would any vendor agree to this? As Hill points out, it’s because it gives the feeling of control to the customer, without giving up any of the real barriers to termination that the customer has today; the most important of which is the cost of migrating off of the provider’s service. Carolyn generalized the benefits of concepts like this beautifully when she said something to the effect of

“The cloud vendor community has to understand what [using an off premises cloud service] looks like to me — it feels like a free fall. I can’t touch things like I can in my own data center (e.g. AC, racks, power monitors, etc.), I can’t yell at people who report to me. Give me a sense of control; give me options if something goes wrong.”

In other words, base service success on customer satisfaction, and provide options if something goes wrong.

What a beautifully concise way to put a very heart felt need of most cloud consumers–control over their company’s IT destiny. By providing different ways that a customer can handle unexpected situations, a cloud provider is signaling that they honor who the ultimate boss is in a cloud computing transaction–the customer.

Hill loved his 30 day notice for termination clause with one of his vendors, and I can see why. Not because he expects to use it, but because it let’s him and the vendor know that Golden Gate University has decision making power, and the cloud vendor serves at the pleasure of Golden Gate University, not the other way around.

Cisco’s Nexus 1000v and the Cloud: Is it really a big deal?

September 17, 2008 Leave a comment

Yesterday, the big announcements at VMWorld 2008 were about Cloud OSes. Today, the big news seemed to be Maritz’s keynote (where he apparently laid out an amazing vision of what VMWare thinks they can achieve in the coming year), and the long rumored Cisco virtual switch.

The latter looks to be better than I had hoped for functionally, though perhaps a little more locked in to VMWare than I’d like. There is an explanation for the latter, however, so it may not be so bad…see below.

I’ve already explained why I love the Nexus concept so much. Today, Cisco and VMWare jointly announced the Nexus 1000v virtual machine access switch, a fully VI compatible software switch that…well, I’ll let Cisco’s data sheet explain it:

“The Cisco Nexus™ 1000V virtual machine access switch is an intelligent software switch implementation for VMware ESX environments. Running inside of the VMware ESX hypervisor, the Cisco Nexus 1000V supports Cisco® VN-Link server virtualization technology, providing

  • Policy-based virtual machine (VM) connectivity
  • Mobile VM security and network policy, and
  • Non-disruptive operational model for your server virtualization, and networking teams.

When server virtualization is deployed in the data center, virtual servers typically are not managed the same way as physical servers. Server virtualization is treated as a special deployment, leading to longer deployment time with a greater degree of coordination among server, network, storage, and security administrators. But with the Cisco Nexus 1000V you can have a consistent networking feature set and provisioning process all the way from the VM to the access, aggregation, and core switches. Your virtual servers can use the same network configuration, security policy, tools, and operational models as physical servers. Virtualization administrators can leverage predefined network policy that follows the nomadic VM and focus on virtual machine administration. This comprehensive set of capabilities helps you to deploy server virtualization faster and realize its benefits sooner.”

In other words, the 1000v is a completely equal player in a Cisco fabric, and can completely leverage all of the skill sets and policy management available in its other switches. Think “my sys admins can do what they do best, and my network admins can do what they do best”. Further more, it supports VN-Link, which allows VMWare systems running on Cisco fabric to VMotion without losing any network or security configuration. Read that last sentence again.

(I wrote some time about about network administrators facing the most change by this whole pooled-resource thing–this feature seals the deal. Those static network maps they used to hang on your wall, showing them exactly what system was connected to what switch port with what IP address are now almost entirely obsolete.)

I love that feature. I will love it even more if it functions in its entirety in the vCloud concept that VMWare is pitching, and all indications are that it will. So, to tell the story here as simply as possible:

  • You create a group of VMs for a distributed application in VConsole
  • You assign network security and policy via Cisco tools, using the same interface as on the physical switches
  • You configure VMWare to allow VMs for the application to get capacity from an external vendor–one of dozens supporting vCloud
  • When an unexpected peak hits, your VM cluster grabs additional capacity as required in the external cloud, without losing network policy and security configurations.

Cloud computing nirvana.

Now, there are some disappointments, as I hinted above. First, the switch is not stackable, as originally hoped, though the interconnectivity of VN-Link probably overrides that. (Is VN-Link just another way to “stack” switches? Networking is not my strong point.)

Update: In the comments below, Omar Sultan of Cisco notes that the switches are, in fact, “virtually stackable”, meaning they can be distributed across multiple physical systems, creating a single network domain for a cluster of machines. I understand that just enough to be dangerous, so I’ll stop there.”

More importantly, I was initially kind of ticked off that Cisco partnered so closely with VMWare without being careful to note that they would be releasing similar technologies with Citrix and Red Hat at a minimum. But, as I thought about it, Citrix hitched its wagon to 3TERA, and 3TERA owns every aspect of the logical infrastructure an application runs on. In AppLogic, you have to use their network representation, load balancers, and so on as a part of your application infrastructure definition, and 3TERA maps those to real resources as it sees fit. For network connections, it relies on a “Logical Connection Manager (LCM)“:

“The logical connection manager implements a key service that abstracts intercomponent communications. It enables AppLogic to define all interactions between components of an application in terms of point-to-point logical connections between virtual appliances. The interactions are controlled and tunneled across physical networks, allowing AppLogic to enforce interaction protocols, detect security breaches and migrate live TCP connections from one IP network to another transparently.”

(from the AppLogic Grid Operating System Technical Overview: System Services)

Thus, there is no concept of a virtual switch, per se, in AppLogic. A quick look at their site shows no other partners in the virtual networking or load balancing space (though Nirvanix is a virtual storage partner), so perhaps Cisco simply hasn’t been given the opportunity or the hooks to participate in the Xen/3TERA Cloud OS.

(If anyone at 3TERA would like to clarify, I would be extremely grateful. If Cisco should be partnering here, I would be happy to add some pressure to them to do so.)

As for Red Hat, I honestly don’t know anything about their VMM, so I can’t guess at why Cisco didn’t do anything there…although my gut tells me that I won’t be waiting long to hear about a partnership between those two.

This switch makes VMWare VMs equal players in the data center network, and that alone is going to disrupt a lot of traditional IT practices. While I was at Cassatt, I remember a colleague predicting that absolutely everything would run in a VM by the end of this decade. That still seems a little aggressive to me, but a lot less so than it did yesterday.

The Cloud Computing Bill of Rights

Update: Title and version number added before Cloud Computing Bill of Rights text below.

Before you architect your application systems for the cloud, you have to set some ground rules on what to expect from the cloud vendors you either directly or indirectly leverage. It is important that you walk into these relationships with certain expectations, in both the short and long term, and both those that protect you and those that protect the vendor.

This post is an attempt to capture many of the core rights that both customers and vendors of the cloud should come to expect, with the goal of setting that baseline for future Cloud Oriented Architecture discussions.

This is but a first pass, presented to the community for feedback, discussion, argument and–if deserved–derision. Your comments below will be greatly appreciated in any case.

The Cloud Computing Bill of Rights (0.1)

In the course of technical history, there exist few critical innovations that forever change the way technical economies operate; forever changing the expectations that customers and vendors have of each other, and the architectures on which both rely for commerce. We, the parties entering into a new era driven by one such innovation–that of network based services, platforms and applications, known at the writing of this document as “cloud computing”–do hereby avow the following (mostly) inalienable rights:

  • Article I: Customers Own Their Data

    1. No vendor shall, in the course of its relationship with any customer, claim ownership of any data uploaded, created, generated, modified, hosted or in any other way associated with the customer’s intellectual property, engineering effort or media creativity. This also includes account configuration data, customer generated tags and categories, usage and traffic metrics, and any other form of analytics or meta data collection.

      Customer data is understood to include all data directly maintained by the customer, as well as that of the customer’s own customers. It is also understood to include all source code and data related to configuring and operating software directly developed by the customer, except for data expressly owned by the underlying infrastructure or platform provided by the vendor.

    2. Vendors shall always provide, at a minimum, API level access to all customer data as described above. This API level access will allow the customer to write software which, when executed against the API, allows access to any customer maintained data, either in bulk or record-by-record as needed. As standards and protocols are defined that allow for bulk or real-time movement of data between cloud vendors, each vendor will endeavor to implement such technologies, and will not attempt to stall such implementation in an attempt to lock in its customers.

  • Article II: Vendors and Customers Jointly Own System Service Levels

    1. Vendors own, and shall do everything in their power to meet, service level targets committed to with any given customer. All required effort and expense necessary to meet those explicit service levels will be spent freely and without additional expense to the customer. While the specific legally binding contracts or business agreements will spell out these requirements, it is noted here that these service level agreements are entered into expressly to protect the customer’s business interests, and all decisions by the vendor will take this into account.

      Where no explicit service level agreement exists with a customer, the vendor will endeavor to meet any expressed service level targets provided in marketing literature or the like. At no time will it be acceptable for a vendor to declare a high level of service at a base price, only to later indicate that that level of service is only available at a higher premium price.

      It is perfectly acceptable, however, for a vendor to expressly sell a higher level of service at a higher price, as long as they make that clear at all points where a customer may evaluate or purchase the service.

    2. Ultimately, though, customers own their service level commitments to their own internal or external customers, and the customer understands that it is their responsibility to take into account possible failures by each vendor that they do business with.

      Customers relying on a single vendor to meet their own service level commitments enter into an implicit agreement to tie their own service level commitments to the vendor’s, and to live and die by the vendor’s own infrastructure reliability. Those customers who take their own commitments seriously will seek to build or obtain independent monitoring, failure recovery and disaster recovery systems.

    3. Where customer/vendor system integration is necessary, the vendor’s must offer options for monitoring the viability of that integration at as many architectural levels as required to allow the customer to meet their own service level commitments. Where standards exist for such monitoring, the vendor will implement those standards in a timely and complete fashion. The vendor should not underestimate the importance of this monitoring to the customer’s own business commitments.

  • Article III: Vendors Own Their Interfaces

    1. Vendors are under no obligation to provide “open” or “standard” interfaces, other than as described above for data access and monitoring. APIs for modifying user experience, frameworks for building extensions or even complete applications for the vendor platform, or such technologies can be developed however the vendor sees fit. If a vendor chooses to require developers to write applications in a custom programming language with esoteric data storage algorithms and heavily piracy protected execution systems, so be it.

      If it seems that this completely abdicates the customer’s power in the business relationship, this is not so. As the “cloud” is a marketplace of technology infrastructures, platforms and applications, the customer exercises their power by choosing where to spend their hard earned money. A decision to select a platform vendor that locks you into proprietary Python libraries, for instance, is a choice to support such programming lock-in. On the other hand, insistence on portable virtual machine formats will drive the market towards a true commodity compute capacity model.

      The key reason for giving vendors such power is to maximize innovation. By restricting how technology gets developed or released, the market risks restricting the ways in which technologists can innovate. History shows that eventually the “open” market catches up to most innovations (or bypasses them altogether), and the pace at which this happens is greatly accelerated by open source. Nonetheless, forcing innovation through open source or any other single method runs the risk of weakening capitalist entrepreneurial risk taking.

    2. The customer, however, has the right to use any method legally possible to extend, replicate, leverage or better any given vendor technology. If a vendor provides a proprietary API for virtual machine management in their cloud, customers (aka “the community”, in this case) have every right to experiment with “home grown” implementations of alternative technologies using that same API. This is also true for replicating cloud platform functionality, or even complete applications–though, again, the right only extends to legal means.

      Possibly the best thing a cloud vendor can do to extend their community, and encourage innovation on their platform from community members is to open their platform as much as possible. By making themselves the “reference platform” for their respective market space, an open vendor creates a petrie dish of sorts for cultivating differentiating features and successes on their platform. Protective proprietary vendors are on their own.

These three articles serve as the baseline for customer/vendor relationships in the new network-based computing marketplace. No claim is made that this document is complete, or final. These articles may be changed or extended at any time, and additional articles can be declared, whether in response to new technologies or business models, or simply to reflect the business reality of the marketplace. It is also a community document, and others are encouraged to bend and shape it in their own venues.

Comments, complaints or questions can be directed to the author through the comments section below.

The Principles of a Cloud Oriented Architecture

The market is hot. The technologies are appearing fast and furious. The tools you need are out there, but they are young, often untested, and always deliver unpredictable reliability. You’ve researched the economics, and you know now that cloud computing is a) here to stay, and b) offers economic advantages that–if realized–could stretch you IT budget and quite possibly catapult your career.

Now what?

What is often overlooked in the gleeful rush to cloud computing is the difficulty in molding the early technologies in the space into a truly bulletproof (or even bullet-resistant) business infrastructure. You see it all over the Internet; the push and pull between innovation and reliability, the concerns about security, monitoring and control, even the constant confusion over what entails cloud computing, what technologies to select for a given problem, and how to create an enterprise-class business system out of those technologies.

The truth is, cloud computing doesn’t launch our technical architectures into the future. It is, at its heart, an economic model that drives the parameters around how you acquire, pay for and scale the infrastructure architectures you already know. Its not a question of changing the required problems to solve when utilizing data centers, just a change to the division of responsibilities amongst yourself, your organization, your cloud providers and the Internet itself.

To this end, I offer you a series of posts (perhaps moving to a WIKI in the near future) describing in depth my research into what it takes to deliver a systems architecture with the following traits:

  1. It partially or entirely incorporates the clouds for at least one layer of the Infrastructure/Platform/Application stack.
  2. Is focused on consumers of cloud technologies, not the requirements of those delivering cloud infrastructures, either public or private (or even dark).
  3. Takes into account a variety of technical, economic and even political factors that systems running in the “cloud” must take into account.
  4. Is focused at least as much on the operational aspects of the system as the design and development aspects

The idea here is not to introduce an entirely new paradigm–that’s the last thing we need given the complexity of the task ahead of us. Nor is it to replace the basic principles of SOA or any other software architecture. Rather, the focus of this series is on how to best prepare for the new set of requirements before us.

Think about it. We already deal (or try to deal) with a world in which we don’t entirely have control over every aspect of the world our applications live in. If we are software developers, we rely on others to build our servers, configure our networks, provide us storage and weld them all together into a cohesive unit. System administrators are, in large enterprises anyway, specializing in OS/application stacks, networking, storage or system management. (Increasingly you can add facilities and traditional utilities to this list.)

Even when we outsource to others–shifting responsibility for management of parts or all of our IT infrastructure to a vendor–the vendor doesn’t have control over significant elements of the end-to-end operations of our applications; namely, the Internet itself. But with outsourcing, we typically turn over entire, intact architecture stacks, with a few, very well bounded integration points to manage (if any) between outsourced systems and locally maintained systems.

The cloud is going to mess this up. I say this not just because the business relationship is different from outsourcing, but also because what you are “turning over” can be a *part* of a system stack. Smugmug outsources storage and job processing, but not the web experience that relies on both. Applications that run entirely on EC2/S3 outsource the entire infrastructure, but not the application development, or even the application system management. (This is why RightScale, Hyperic and others are finding some traction with AWS customers.)

To prepare for a cloud oriented architecture, one understand what responsibilities lie where. So, I’ll give you a teaser of what is to come with the short-short version of where I see these responsibilities lying (subject to change as I talk to others, including yourselves if you choose to comment on this post):

  • The enterprise has responsibility for the following:
    • Defining the business solution to be solved, the use cases that define that solution, and the functional requirements to deliver those use cases
    • Evaluating the selection of technical and economic approaches for delivering those functional requirements, and selecting the best combination of the two. (In other words, the best combination may not contain either the best technical or best economic selection, but will outweigh any other combination of the two.)
    • Owning the service level agreements with the business for the delivery of those use cases. This is critically important. More on this below.
  • The cloud provider has responsibility for the following:
    • Delivering what they promised you (or the market) that they would deliver. No more, no less.
    • Providing you with transparent and honest billing and support services.
  • The Internet itself is only responsible for providing you with an open, survivability reliable infrastructure for interconnecting the networks you need to run your applications and/or services. There are no promises here about reliability or scalability or even availability. It should be considered a technical wilderness, and treated accordingly.

Now, about SLAs. Your cloud provider does not own your SLAs, you do. They may provide some SLAs that support your own, but they are not to be blamed if you fail to achieve the SLAs demanded of you. If your applications or services fail because the cloud failed, you failed. Given that, don’t “outsource” your SLAs, at least not logically. Own them.

In fact, I would argue that the single most important function of a cloud-centric IT shop after getting required business functionality up and running in the first place, is monitoring and actively managing that functionality; switching vendors, if necessary, to continue service at required levels. The one big piece of IT-specific software that should always run in IT data centers, in my opinion, is the NOC infrastructure. (Although, perhaps in this context its more of a Cloud Operations Center, but I hate the resulting acronym for obvious reasons.)

I’ll focus more on these responsibilities in future posts. All posts in this series will be tagged “coa principles”. Please feel free to provide me feedback in the comments, contact me to review your thoughts on this topic, or simply to send me links that you think I should be aware of. I am also working to find other bloggers who wish to take ownerships of parts of this primer (cloud security, for example) so let me know if you are interested there as well.

I am excited about this. This body of knowledge (or at least the faint traces of knowledge) have been rattling inside my head for some time, and it feels good to finally be sharing them with you.

Cloud Outages, and Why *You* Have To Design For Failure

I haven’t posted for a while because I have been thinking…a lot…about cloud computing, inevitable data center outages, and what it means to application architectures. Try as I might to put the problem on the cloud providers, I keep coming back to one bare fact; the cloud is going to expose a lot of the shortcomings of today’s distributed architectures, and this time it’s up to us to make things right.

It all started with some highly informative posts from the Data Center Knowledge blog chronicling outages at major hosting companies, and failures that helped online companies learn important lessons about scaling, etc. As I read these posts, the thought that struck my mind was, “Well, of course. These types of things are inevitable. Who could possibly predict every possible negative influence on an application, much less a data center.” I’ve been in enough enterprise IT shops to know that even the very best are prepared for something unexpected to happen. In fact, what defines the best shops are that they assume failure and prepare for it.

Then came the stories of disgruntled employees locking down critical information systems or punching the emergency power kill switch on their way out the door. Whether or not you are using the cloud, human psychology being what it is, we have to live every day with immaturity or even just plain insanity.

Yet, each time one of the big name cloud vendors has an outage–Google had one, as did Amazon a few times, including this weekend–there are a bunch of IT guys crying out, “Well, there you go. The cloud is not ready for production.”

Baloney, I say. (Well, I actually use different vocabulary, but you get the drift.) Truth is, the cloud is just exposing people’s unreasonable expectations for what a distributed, disparate computing environment provides. The idea that some capacity vendor is going to give you 100% up time for years on end–whether they promised it or not–is just delusional. Getting angry at your vendor for an isolated incident or poo-pooing the market in general just demonstrates a lack of understanding of the reality of networked applications and infrastructure.

If you are building an application for the Internet–much less the cloud–you are building a distributed software system. A distributed system, by definition, relies on a network for communication. Some years ago, Sun’s Peter Deutsch and others at Sun postulated a series of fallacies that tend to be the pitfalls that all distributed systems developers run into at one time or another in their career. Hell, I still have to check my work against these each and every time I design a distributed system.

Key among these is the delusion that the network is reliable. It isn’t, it never has been, and it never will be. For network applications, great design is defined by the application or application system’s ability to weather undesirable states. There are a variety of techniques for achieving this, such as redundancy and caching, but I will dive into those in more depth in a later post. (A great source for these concepts is http://highscalability.com.)

Some of the true pioneers in the cloud realized this early. Phil Wainwright notes that Alan Williamson of Mediafed made what appears to be a prescient decision to split their processing load between two cloud providers, Amazon EC2/S3 and FlexiScale. Even Amazon themselves use caching to mitigate S3 outages on their retail sites (see bottom of linked post for their statement).

Michael Hickins notes in his E-Piphanies blog that this may be an amazing opportunity for some skilled entrepreneurs to broker failure resistance in the cloud. I agree, but I think good distributed system hygiene begins at home. I think the best statement is a comment I saw on ReadWriteWeb:

“People rankled about 5 hours of downtime should try providing the same level of service. In my experience, it’s much easier to write-off your own mistakes (and most organizations do), than it is to understand someone else’s — even when they’re doing a better job than you would.”

Amen, brother.

So, in a near future post I’ll go into some depth about what you can do to utilize a “cloud oriented architecture”. Until then, remember: Only you can prevent distributed application failures.

Enterprise Architecture, Business Continuity and Integrating the Cloud

February 28, 2008 Leave a comment

(Update: Throughout the original version of this post, I had misspelled Mr. Vambenepe’s name. This is now corrected.)

William Vambenepe, a product architect at Oracle focusing on enterprise management of applications and middleware, pointed me to a blog by David Linthicum on IntelligentEnterprise that makes the case for why enterprise architects must plan for SaaS. In a very high level, but well reasoned post, Linthicum highlights why SaaS systems should be considered a part of enterprise architectures, not tangential to them.

As Vambenepe points out, perhaps the most interesting observation from Linthicum is the following:

Third, get in the mindset of SaaS-delivered systems being enterprise applications, knowing they have to be managed as such. In many instances, enterprise architects are in a state of denial when it comes to SaaS, despite the fact that these SaaS-delivered systems are becoming mission-critical. If you don’t believe that, just see what happens if Salesforce.com has an outage.

I don’t want to simply repeat Vambenepe’s excellent analysis, and I absolutely agree with him. So let me just add something about SLAuto.

Take a look at Vambenepe’s immediate response:

I very much agree with this view and the resulting requirements for us vendors of IT management tools.

Now add the comments from Microsoft’s Gabriel Morgan that I discussed a couple of weeks ago.

Take for example Microsoft Word. Product Features such as Import/Export, Mail Merge, Rich Editing, HTML support, Charts and Graphs and Templates are the types of features that Customer 1.0 values most in a product. SaaS Products are much different because Customer 2.0 demands it. Not only must a product include traditional product features, it must also include operational features such as Configure Service, Manage Service SLA, Manage Add-On Features, Monitor Service Usage Statistics, Self-Service Incident Resolution as well.

Gabriel’s point boiled down to the following equation:

Service Offering = (Product Features) + (Operational Features)

which I find to be entirely in agreement with Linthicum and Vambenepe.

As I am wont to do, let me push “Operational Features” as far as I think they can go.

In the end, what customers want from any service–software, infrastructure or otherwise–is control over the balance of quality, cost and time-to-market. Quality is measured through specific metrics, typically called service level metrics. Service level agreements (SLAs) are commitments to maintain service level metrics within commonly agreed boundaries and rules. In the end, all of these “operational features” are about allowing the end user to either

  1. define the service level metrics and/or their boundaries (e.g. define the SLA), or
  2. define how the system should respond if a metric fails to meet the SLA.

Item “2” is SLAuto.

I would argue that what you don’t want is a closed loop SLAuto offering from any of your vendors. In fact, I propose right here, right now, that a standard (and, I am sure Simon Wardley would argue, open source) protocol or set of protocols for the following:

  1. Defining service level metrics (probably already exists?)
  2. Defining SLA bounds and rules (may also exist?)
  3. Defining alerts or complex events that indicate that an SLA was violated

Vendors could then use these protocols to build Operational Features that support a distributed SLAuto fabric, where the ultimate control over what to do in severe SLA violations can be controlled and managed outside of any individual provider’s infrastructure, preferably at a site of the customer’s choosing. This “customer advocate” SLAuto system would then coordinate with all of the customer’s other business systems’ individual SLAuto to become the automated enforcer of business continuity. In the end, that is the most fundamental role of IT, whether it is distributed or centralized, in any modern, information driven business.

“Nice, James,” you say. “Very pretty ‘pie-in-the-sky’ stuff, but none of it exists today. So what are we supposed to do now?

Implement SLAuto internally in your own data centers with your existing systems, that’s what. Integrate SLAuto for SaaS as you understand the Operational Feature APIs from your vendors, and those vendors, your SLAuto vendor and/or your systems talent can develop interfaces into your own SLAuto infrastructure.

Evolve towards nirvana, don’t try to reach it by taking tabs of vendor acid.

If you want more advice on how to do all of this, drop me a line (james dot urquhart at cassatt dot com) or comment below.

One Step To Prepare For Cloud Computing

January 29, 2008 Leave a comment

Some of you may be wondering why I am making such a big stink about software architecture on a blog about service level automation (SLAuto). Well, as Todd Biske points out, “the relationships (and potentially collisions) between the worlds of enterprise system management, business process management, web service management, business activity monitoring, and business intelligence” are easier to resolve if the appropriate access to metrics is provided for a software service. For SLAuto, this means the more feedback you can provide from the service, process, data and infrastructure levels of your software architecture, the easier it is to automate service level compliance.

Let’s look at a few examples for each level:

  • Service/Application: From the end user’s perspective, this is what service levels are all about. Key metrics such as transaction rates (how many orders/hour, etc.), response times, error rates, and availability are what the end users of a service (e.g. consumers, business stakeholders, etc.) really care about.
  • Business Process: Business process metrics can warn the SLAuto environment about cross-service issues, business rule violations or other extraordinary conditions in the process cycle that would warrant capacity changes at the BPM or service levels.
  • Data Storage/Management: Primarily, this layer can inform the SLAuto system about storage needs and storage provisioning, which in turn is critical to automated deployment of applications into a dynamic environment.
  • Infrastructure: This is the most common form of metric used to make SLAuto decisions today. Such metrics as CPU utilization, memory utilization and I/O rates are commonly used in both virtualized and non-virtualized automated environments.

As noted, digital measurement of these data points can feed an SLAuto policy engine to trigger capacity adjustment, failure recover or other applicable actions as necessary to remain within defined service thresholds. While most of the technology required to support SLAuto is available, the truth is that the monitoring/metrics side of things is the most uncharted territory. As an action item, I ask all of you to take Todd’s words of wisdom into account, and design not only for functionality, but also manageability. This will aid you greatly in the quest to build fluid systems that can best take advantage of utility infrastructure today.