Archive for the ‘analyze’ Category

Two Announcements to Pay Attention To This Week

November 12, 2008 Leave a comment

I know I promised a post on how the network fits into cloud computing, but after a wonderful first part of my week spending time first catching up on reading, then one-on-one with my 4-year old son, I’m finally digging into what’s happened in the last two days in the cloud-o-sphere. While the network post remains important to me, there were several announcements that caught my eye, and I thought I’d run through two of them quickly and give you a sense of why they matter

The first announcement came from Replicate Technologies, Rich Miller’s young company, which is focusing initially on virtualization configuration analysis. The Replicate Datacenter Analyzer (RDA) is a powerful analysis and management tool for evaluating the configuration and deployment of virtual machines in an enterprise data center environment. But it goes beyond just evaluating the VMs itself, to evaluating the server, network and storage configuration required to support things like vMotion.

Sound boring, and perhaps not cloud related? Well, if you read Rich’s blog in depth, you may find that he has a very interesting longer term objective. Building on the success of RDA, Replicate aims to become a core element of a virtualized data center operations platform, eventually including hybrid cloud configurations, etc. While initially focused on individual servers, one excellent vision that Rich has is to manage the relationships between VMs in such a tool, so that operations taken on one server will take into account the dependencies on other servers. Very cool.

Watch the introductory video for the fastest impression of what Replicate has here. If you manage virtual machines, sit up and take notice.

The other announcement that caught my eye was the new positioning and features introduced by my alma mater, Cassatt Corporation this week. I’ve often argued that Cassatt is an excellent example of a private cloud infrastructure, and now they are actively promoting themselves as such (although they use the term “internal cloud”).

It’s about freaking time. With a mature, “burned in”, relatively technology agnostic platform that has perhaps the easiest policy management user experience ever (though not necessarily the prettiest), Cassatt has always been one of my favorite infrastructure plays (though I admit some bias). They support an incredible array of hardware, virtualization and OS platforms, and provide the rare ability to manage not only virtual machines, but also bare metal systems. You get automated power management, resource optimization, image management, and dynamic provisioning. For the latter, not only is server provisioning automated, but also network provisioning–such that deploying an image on a server triggers Cassatt to reprogram the ports that the target server is connected to so that they sit on the correct VLAN for the software about to be booted.

The announcement talks a lot about Cassatt’s monitoring capabilities, and a service they provide around application profiling. I haven’t been briefed about these, but given their experience with server power management (a very “profiling focused” activity) I believe they could probably have some unique value-add there. What I remember from six months ago was that they introduced improved dynamic load allocation capabilities that could use just about any digital metric (technical or business oriented) to set upper and lower performance thresholds for scaling. Thus, you could use CPU utilization, transaction rates, user sessions or even market activity to determine the need for more or less servers for an application. Not too many others break away from the easy CPU/memory utilization stats to drive scale.

Now, having said all those nice things, I have to take Cassatt to task for one thing. Throughout the press release, Cassatt talks about Amazon and Google like infrastructure. However, Cassatt is doing nothing to replicate the APIs of either Amazon (which would be logical) or Google (which would make no sense at all). In other words, as announced, Cassatt is building on their own proprietary protocols and interfaces, with no ties to any external clouds or alternative cloud platforms. This is not a very “commodity cloud computing” friendly approach, and obviously I would like to see that changed. And, the truth is, none of their direct competitors are doing so, either (with the possible exception of the academic research project, EUCALYPTUS).

The short short of it is if you are looking at building a private cloud, don’t overlook Cassatt.

There was another announcement from Hyperic that I want to comment on, but I’m due to chat with a Hyperic executive soon, so I’ll reserve that post for later. The fall of 2008 remains a heady time for cloud computing, so expect many more of these types of posts in the coming weeks.


Elements of a Cloud Oriented Architecture

In my post, The Principles of Cloud Oriented Architectures, I introduced you to the concept of a software system architecture designed with “the cloud” in mind:

“…I offer you a series of posts…describing in depth my research into what it takes to deliver a systems architecture with the following traits:

  1. It partially or entirely incorporates the clouds for at least one layer of the Infrastructure/Platform/Application stack.
  2. Is focused on consumers of cloud technologies, not the requirements of those delivering cloud infrastructures, either public or private (or even dark).
  3. Takes into account a variety of technical, economic and even political factors that systems running in the “cloud” must take into account.
  4. Is focused at least as much on the operational aspects of the system as the design and development aspects

The idea here is not to introduce an entirely new paradigm–that’s the last thing we need given the complexity of the task ahead of us. Nor is it to replace the basic principles of SOA or any other software architecture. Rather, the focus of this series is on how to best prepare for the new set of requirements before us.”

I followed that up with a post (well, two really) that set out to define what our expectations of “the cloud” ought to be. The idea behind the Cloud Computing Bill of Rights was not to lay out a policy platform–though I am flattered that some would like to use it as the basis of one— but rather to set out some guidelines about what cloud computing customers should anticipate in their architectures. In this continuing “COA principles” series, I intend to lay out what can be done to leverage what vendors deliver, and design around what they fail to deliver.

With that basic framework laid out, the next step is to break down what technology elements need to be considered when engineering for the cloud. This post will cover only the list of some such elements as I understand them today (and feel free to use the comments below to add your own insights), and future posts will provide a more thorough analysis of individual elements and/or related groups of elements. The series is really very “stream of consciousness”, so don’t expect too much structure or continuity.

When considering what elements matter in a Cloud Oriented Architecture, we consider first that we are talking about distributed systems. Simply utilizing to do your Customer Relationship Management doesn’t require an architecture; integrating it with your SAP billing systems does. As your SAP systems most likely don’t run in data centers, the latter is a distributed systems problem.

Most distributed systems problems have just a few basic elements. For example:

  • Distribution of responsibilities among component parts

  • Dependency management between those component parts

  • Scalability and reliability

    • Of the system as a whole
    • Of each component
  • Data Access and Management

  • Communication and Networking

  • Monitoring and Systems Management

However, because cloud computing involves leveraging services and systems entirely outside of the architect’s control, several additional issues must be considered. Again, for example:

  • How are the responsibilities of a complex distributed system best managed when the services being consumed are relatively fixed in the tasks they can perform?

  • How are the cloud customer’s own SLA commitments best addressed when the ability to monitor and manage components of the system may be below the standards required for the task?

  • How are the economics of the cloud best leveraged?

    • How can a company gain the most work for the least amount of money?
    • How can a company leverage the cloud marketplace for not just cost savings, but also increased availability and system performance?

In an attempt to address the more cloud-specific distributed systems architecture issues, I’ve come up with the following list of elements to be addressed in a typical Cloud Oriented Architecture:

  • Service Fluidity – How does the system best allow for static redeployment and/or “live motion” of component pieces within and across hardware, facility and network boundaries? Specific issues to consider here include:

    • Distributed application architecture, or how is the system designed to manage component dependencies while allowing the system to dynamically find each component as required? (Hint: this problem has been studied thoroughly by such practices as SOA, EDA, etc.)
    • Network resiliency, or how does the system respond to changes in network location, including changes in IP addressing, routing and security?
  • Monitoring – How is the behavior and effectiveness of the system measured and tracked both to meet existing SLAs, as well as to allow developers to improve the overall system in the future? Issues to be considered here include:

    • Load monitoring, or how do you measure system load when system components are managed by multiple vendors with little or know formal agreements of how to share such data with the customer or each other?
    • Cost monitoring, or how does the customer get a accurate accounting of the costs associated with running the system from their point of view?
  • Management – How does the customer configure and maintain the overall system based on current and ongoing technical and business requirements? Examples of what needs to be considered here includes:

    • Cost, or what adjustments can be made to the system capacity or deployment to provide the required amount of service capacity at the lowest cost possible? This includes ways to manage the efficiency of computation, networking and storage.
    • Scalability, or how does the system itself allow changes to capacity to meet required workloads? These changes can happen:
      • vertically (e.g. get a bigger box for existing components–physically or virtually)
      • horizontally (e.g. add or remove additional instances of one or more components as required)
      • from a network latency perspective (adjust the ways in which the system accesses the network in order to increase overall system performance)
    • Availability, or how does the system react to failure or any one component, or any group of components (e.g. when an entire vendor cloud goes offline)?
  • Compliance – How does the overall system meet organizational, industry and legislative regulatory requirements–again, despite being made up of components from a variety of vendors who may themselves provide computing in a variety of legal jurisdictions?

Now comes the fun of breaking these down a bit, and talking about specific technologies and practices that can address them. Please, give me your feedback (or write up your criticism on your own blog, but link here so I can find you). Point me towards references to other ways to think about the problem. I look forward to the conversation.

Cassatt Announces Active Response 5.1 with Demand Base Policies

Ken Oestreich blogged recently about the very cool, probably landmark release of Cassatt that just became available, Cassatt Active Response 5.1. He very eloquently runs down the biggest feature–demand based policies–so I won’t repeat all of that here. What I thought I would do instead is relate my personal thoughts on monitoring based policies and how they are the key disruptive technology for data centers today.

To be sure, everyone is talking about server virtualization in the data center market today, and that’s fine. It’s core short-term benefit, physical system consolidation and increased utilization is key for cost-constrained IT departments, and features such as live motion and automatic backup are creating new opportunities that should be carefully considered. However, virtualization alone is limited in its applications, and does little to actually optimize a data center over time. (This is why VMWare is emphasizing management over just virtualizing servers these days.)

The technology that will make the long term difference is resource optimization: applying automation technologies to tuning how and when physical and virtual infrastructure is used to solve specific business needs. It is the automation software that will really change the “deploy and babysit” culture of most data centers and labs today. The new description will be more like “deploy and ignore”.

To really optimize resource usage in real time, the automation software must use a combination of monitoring (aka “measure“), a policy engine or other logic system (aka “analyze“) and interfaces to the control systems of the equipment and software it is managing (aka “respond“). It turns out that the “respond” part of the equation is actually pretty straight forward–lots of work, but straight forward. Just write “driver” like components that know how to talk to various data center equipment (e.g. Windows, DRAC, Cisco NX-OS, NetApp Data ONTAP, etc.), as well as handle error conditions by directly responding or forwarding the information to the policy engine.

The other two, however, require more immediate configuration by the end user. Measure and analyze, in fact, are where the entire set of Service Level Automation (SLAuto) parameters are defined and executed on. So, this is where the key user interface between the SLAuto system and end user has to happen.

What Cassatt has announced is a new user interface to define demand based policies as the end user sees fit. For example, what defines an idle server? Some systems use very little CPU while they wait for something to happen (at which point they get much busier), so simply measuring CPU isn’t good enough in those cases. Ditto for memory in systems that are compute intensive but handle very little state.

What Cassatt did that is so brilliant (and so unique) is to allow the end user to leverage the full range of SNMP attributes for their OS, as well as JMX and even scripts running on the monitored system to create expressions that define an idle metric that is right for that system. For example, on a test system you may in fact say that a system is idle when the master test controller software indicates that no test is being run on that box. On another system, you may say its idle when no user accounts are currently active. Its up to you to define when to attempt to shut down a box, or reduce capacity for a scale-out application.

Even when such an “idle” system is identified, Cassatt gives you the ability to go further and write some “spot checks” to make sure they system is actually OK to shut down. For example, in the aforementioned test system, Cassatt may determine that its worth trying to power down a system, but a spot check could be run to determine if a given process is still running, or an administrator account is currently actively logged in to the box that would indicate to Cassatt that it should ignore that system for now.

I know of no one else that has this level of GUI configurable monitor/analyze/respond sophistication today. If anyone wants to challenge that, feel free. Now that I no longer work at Cassatt, I’d be happy to learn about (and write about) alternatives in the marketplace. Just remember that it has to be easy to configure and execute these policies, and scripting the policies themselves is not good enough.

It is clear from the rush to release resource optimization products for the cloud, such as RightScale, Scalr, and others, that this will be a key feature for distributed systems moving forward. In my opinion, Cassatt has launched itself into the lead spot for on premises enterprise utility computing. I can’t wait to see who responds with the next great advancement.

Disclaimer: I am a Cassatt shareholder (or soon will be).

An easy way to get started with SLAuto

September 4, 2007 4 comments

It’s been an interesting week, leading up to the Labor Day weekend, but as of this morning I get to talk more openly about one project that has been taking a great deal of my time. As I have blogged about Service Level Automation (“SLAuto”), it may have dawned on some of you that achieving nirvana here means changing a lot about your current architecture and practices.

For example, decoupling software from hardware is easy to say, but requires significant planning and execution to implement (though this can be simplified somewhat with the right platform). Building the correct monitors, policies and interfaces is also time intensive work that requires the correct platform for success. However, as noted before, the biggest barriers to implementing SLAuto and utility computing are cultural.

There is an opportunity out there right now to introduce SLAuto without all of the trappings of utility computing, especially the difficult decoupling of software from hardware. It is an opportunity that the Silicon Valley is going ga-ga over, and it is a real problem with real dollar costs for every data center on the planet.

The opportunity is energy consumption management, aka the “green data center”.

Rather than pitch Cassatt’s solution directly, I prefer to talk about the technical opportunity as a whole. So let’s evaluate what is going on in the “GDC” space these days. As I see it, there are three basic technical approaches to “green” right now:

  1. More efficient equipment, e.g. more power efficient chips, server architectures, power distribution systems, etc.
  2. More efficient cooling, e.g. hot/cold aisles, liquid cooling, outside air systems, etc.
  3. Consolidation, e.g. virtualization, mainframes, etc.

Still, there is something obvious missing here: no matter which of these technologies you consider, not one of them is actually going to turn off unused capacity. In other words, while everyone is working to build a better light bulb or to design your lighting so you need fewer bulbs, no one is turning off the lights when no-one is in the room.

That’s where SLAuto comes in. I contend that there are huge tracks of computing in any large enterprise where compute capacity runs idle for extended periods. Desktop systems are certainly one of the biggest offenders, as are grid computing environments that are not pushed to maximum capacity at all times. However, possibly the biggest offender in any organization that does in-house development, extensive packaged system customization or business system integration is the dev/test environment.

Imagine such a lab where capacity that will be unused each evening/weekend, or for all but two weeks of a typical development cycle, or at all times except when testing a patch to a three year old rev of product, was shut down until needed. Turned off. Non-operational. Idle, but not idling.

Of course, most lab administrators probably feel extremely uncomfortable with this proposition. How are you going to do this without affecting developer/QA productivity? How do you know its OK to turn off a system? Why would my engineers even consider allowing their systems to be managed this way?

SLAuto addresses these concerns by simply applying intelligence to power management. A policy-based approach means a server can be scheduled for shutdown each evening (say, at 7PM), but be evaluated before shutdown against a set of policies that determine whether it is actually OK to complete the shut down.

Some example policies might be:

  • Are certain processes running that indicate a development/build/test task is still underway?
  • Is a specific user account logged in to the system right now?
  • Has disk activity been extremely low for the last four hours?
  • Did the owner of the server or one of his/her designated colleagues “opt-out” of the scheduled shutdown for that evening?

Once these policies are evaluated, we can see if the server meets the criteria to be shut down as requested. If not, keep it running. Such a system needs to also provide interfaces for both the data center administrators and the individual server owners/users to control the power state of their systems at all times, set policies and monitor power activities for managed servers.

I’ll talk more about this in the coming week, but I welcome your input. Would you shut down servers in your lab? Your grid environment? Your production environment? What are your concerns with this approach? What policies come to mind that would be simple and/or difficult to implement?

SLAuto vs. SLA

A while back, Eric Westerkamp over at eCloudM asked a simple question that got me thinking. Eric wondered aloud (in print?) whether “using the Term Service Level Automation (SLA) causes confusion when presenting the ideas and topics into the business community. I have most often seen SLA refer to Service Level Agreements. While similar in concept, they are very different in implementation.”

So, to keep things clear in my blog, I will now use the acronym SLAuto for Service Level Automation, and retain the SLA moniker for Service Level Agreements. I hope this eliminates confusion and allows the market to talk more freely about Service Level Automation.

Speaking of SLA, though, Steve Jones posted a great example of symbiotic relationship between the customer and the service provider in the SLA equation. To put it in the context of an enterprise data center, you could offer 100% up time, .1 sec response time and a 5 minute turnaround time, but it wouldn’t be of any value if the customer’s application was buggy, they were on a dial-up network and it took them six weeks to get the requirements right for a build-out.

Now, let’s look at that in the context of SLAuto. To my eye, the service provider in an SLAuto environment is the infrastructure. The customer is any component or person that accesses or depends on any piece of that infrastructure. Thus, any SOA service can be a service provider in one context, and a customer in another. Even the policy engine(s) that automate the infrastructure can be thought of as a customer in the context of monitoring and management, and a service provider in terms of an interface for other customers to define service level parameters.

Steve’s example hints that I could buy the “bestest”, fastest, coolest high tech servers, switches and storage on the planet and it wouldn’t increase my service levels if I couldn’t deliver that infrastructure to the applications that required it quickly, efficiently and (most importantly) reliably. Or, for that matter, if those applications couldn’t take advantage of it. So, if you’re going to automate, your policy engine should be (you guessed it) quick, efficient and reliable. If it isn’t, then your SLAs are limited by your SLAuto capabilities.

Not what you intended, I would think…

Service Level Automation Deconstructed: Analyzing Service Levels

This is the second post in my series providing a brief overview of the three critical assumptions of a Service Level Automation environment. Today I want to focus on the ways in which the metrics gathered from the “measure” capabilities of an SLA environment are evaluated to determine if and what action should be taken by the “response” capabilities.

Let me first acknowledge that my discussion of the measure capabilities included some analysis of simple metrics to create complex metrics. This is one piece of the analysis puzzle, and is a critical one to acknowledge. Ideally, all software and hardware systems would be designed to intelligently communicate the metrics that matter most to determine service levels. Where this consolidation occurs depends on the requirements of the environment:

  • Centralized approach: Gather fundamental data from target systems to central metrics processor and consolidate metrics there. The advantage here is having one place to maintain consolidation rules. The disadvantage is increased network traffic.
  • Decentralized approach: Gather fundamental data and do any analysis necessary to consolidate the fundamental data into a simplified composite metric there. Send the composite metric to the core rules engine (which I will discuss next).

Metrics consolidation is not really the core analytics function of a Service Level Automation architecture, however. The key functions are actually the following:

  • Are metrics being received as expected? (A negative response would likely indicate a failure in the target component or the communication chain with that component)
  • Are the metrics within the business and IT service level goals set for that metric
  • If metrics are outside of established service level goals, what response should be taken by the SLA environment

Given my recent reading into complex event processing (CEP), this seems like at least a specialized form of event processing to me. The analysis capabilities of an SLA environment must constantly monitor the incoming metrics data stream, look for patterns of interest (namely goal violations, but who knows…) and trigger a response when conditions dictate.

The great thing about this particular EP problem is that well designed solutions can be replicated to all data centers using similar metrics and response mechanisms (e.g. power controllers, OSes, switch interfaces, etc.). Since there are actually relatively few components in the data center stack to be managed (servers [physical, virtual, application, etc.], network and storage), the rule set required to provide basic SLA capabilities is replicable across a wide variety of customer environments.

(That’s not to say the rule set is simple…its actually quite complex, and can be affected by new types of measurement and new technologies to be managed. Buy is definitely preferred over build in this space, but some customizability is always necessary.)

Finally, I’d also like to point out that there is a similar analysis function at the response end as at the measure end. Namely, it is often desirable for the response mechanism to take a composite action request and break it into discrete execution steps. The best example I can think of for this is a “power down” action sent from the SLA analysis environment to a server. Typical power controllers will take such a request, signal to the OS that a shutdown is imminent, whereupon the OS will execute any scripts and actions required before signalling that OS shutdown is complete. At that time, the power controller turns off power to the server.

As with measure, I will use the label “analyze” to reflect future posts expanding on the analysis concept. As always, I welcome your feedback and hope you will join the SLA conversation.

Two articles mentioning Service Level Automation

I recently set up Google Alerts to inform me about references to Service Level Automation on the web. There were many articles returned this week, (many of which involved Cassatt), but I found two additional articles of note. Each makes mention of Service Level Automation, and represents the growing interest in this approach.

The first is from the March 2006 issue of ACM Queue, entitled “Under New Management“. The article was written by Duncan Johnston-Watt, the founder of Enigmatec. Johnston-Watt does an excellent job of outlining basic issues around one possible architecture for an autonomic data center. As expected for Enigmatec, its a policy automation focused approach, and is, in fact, one of the few articles from a policy engine vendor that I have see where the term Service Level Automation is used correctly.

Unfortunately, I don’t necessarily agree that Johnston-Watt’s architecture is optimal enterprise data centers. (It requires development of process automation flows to “optimize operational processes”–a significant amount of work that is prone to introducing new inefficiencies. It is also agent based, which alters the footprint of the software stacks being run in the data center, and can negatively affect the execution and architecture of the applications being managed.) All in all, though, there is some excellent information here for those thinking about Service Level Automation holistically, across the entire data center.

The other article, entitled “Virtually Speaking: Xen Achieves Higher Enterprise Consciousness” was published April 6, 2007 on ServerWatch. In the last few paragraphs of the article, uXcomm’s aquisition of Virtugo is covered. In it, uXcomm claims the combination their Xen management tools and Virtugo’s VMWare tools “fills a gap not just in uXcomm’s portfolio but in the virtual landscape as well. Until now […] there was a gap between service-level automation offerings and performance management products.”

Hmmm. Not sure how providing SLA for only virtual servers counts as filling gaps…but, even so, I hope uXcomm is aware that everyone in this space realizes the need for resource optimization includes VM performance management. I guess my question would be, what is uXcomm doing about marrying Service Level Automation to the rest of the data center?

As a side note, I know that I owe two more articles on my Service Level Automation Deconstructed series. I am working on the “Analyze” overview now, but have discovered some interesting technology to discuss here that I am reading up on now.