Archive for the ‘respond’ Category

Elements of a Cloud Oriented Architecture

In my post, The Principles of Cloud Oriented Architectures, I introduced you to the concept of a software system architecture designed with “the cloud” in mind:

“…I offer you a series of posts…describing in depth my research into what it takes to deliver a systems architecture with the following traits:

  1. It partially or entirely incorporates the clouds for at least one layer of the Infrastructure/Platform/Application stack.
  2. Is focused on consumers of cloud technologies, not the requirements of those delivering cloud infrastructures, either public or private (or even dark).
  3. Takes into account a variety of technical, economic and even political factors that systems running in the “cloud” must take into account.
  4. Is focused at least as much on the operational aspects of the system as the design and development aspects

The idea here is not to introduce an entirely new paradigm–that’s the last thing we need given the complexity of the task ahead of us. Nor is it to replace the basic principles of SOA or any other software architecture. Rather, the focus of this series is on how to best prepare for the new set of requirements before us.”

I followed that up with a post (well, two really) that set out to define what our expectations of “the cloud” ought to be. The idea behind the Cloud Computing Bill of Rights was not to lay out a policy platform–though I am flattered that some would like to use it as the basis of one— but rather to set out some guidelines about what cloud computing customers should anticipate in their architectures. In this continuing “COA principles” series, I intend to lay out what can be done to leverage what vendors deliver, and design around what they fail to deliver.

With that basic framework laid out, the next step is to break down what technology elements need to be considered when engineering for the cloud. This post will cover only the list of some such elements as I understand them today (and feel free to use the comments below to add your own insights), and future posts will provide a more thorough analysis of individual elements and/or related groups of elements. The series is really very “stream of consciousness”, so don’t expect too much structure or continuity.

When considering what elements matter in a Cloud Oriented Architecture, we consider first that we are talking about distributed systems. Simply utilizing to do your Customer Relationship Management doesn’t require an architecture; integrating it with your SAP billing systems does. As your SAP systems most likely don’t run in data centers, the latter is a distributed systems problem.

Most distributed systems problems have just a few basic elements. For example:

  • Distribution of responsibilities among component parts

  • Dependency management between those component parts

  • Scalability and reliability

    • Of the system as a whole
    • Of each component
  • Data Access and Management

  • Communication and Networking

  • Monitoring and Systems Management

However, because cloud computing involves leveraging services and systems entirely outside of the architect’s control, several additional issues must be considered. Again, for example:

  • How are the responsibilities of a complex distributed system best managed when the services being consumed are relatively fixed in the tasks they can perform?

  • How are the cloud customer’s own SLA commitments best addressed when the ability to monitor and manage components of the system may be below the standards required for the task?

  • How are the economics of the cloud best leveraged?

    • How can a company gain the most work for the least amount of money?
    • How can a company leverage the cloud marketplace for not just cost savings, but also increased availability and system performance?

In an attempt to address the more cloud-specific distributed systems architecture issues, I’ve come up with the following list of elements to be addressed in a typical Cloud Oriented Architecture:

  • Service Fluidity – How does the system best allow for static redeployment and/or “live motion” of component pieces within and across hardware, facility and network boundaries? Specific issues to consider here include:

    • Distributed application architecture, or how is the system designed to manage component dependencies while allowing the system to dynamically find each component as required? (Hint: this problem has been studied thoroughly by such practices as SOA, EDA, etc.)
    • Network resiliency, or how does the system respond to changes in network location, including changes in IP addressing, routing and security?
  • Monitoring – How is the behavior and effectiveness of the system measured and tracked both to meet existing SLAs, as well as to allow developers to improve the overall system in the future? Issues to be considered here include:

    • Load monitoring, or how do you measure system load when system components are managed by multiple vendors with little or know formal agreements of how to share such data with the customer or each other?
    • Cost monitoring, or how does the customer get a accurate accounting of the costs associated with running the system from their point of view?
  • Management – How does the customer configure and maintain the overall system based on current and ongoing technical and business requirements? Examples of what needs to be considered here includes:

    • Cost, or what adjustments can be made to the system capacity or deployment to provide the required amount of service capacity at the lowest cost possible? This includes ways to manage the efficiency of computation, networking and storage.
    • Scalability, or how does the system itself allow changes to capacity to meet required workloads? These changes can happen:
      • vertically (e.g. get a bigger box for existing components–physically or virtually)
      • horizontally (e.g. add or remove additional instances of one or more components as required)
      • from a network latency perspective (adjust the ways in which the system accesses the network in order to increase overall system performance)
    • Availability, or how does the system react to failure or any one component, or any group of components (e.g. when an entire vendor cloud goes offline)?
  • Compliance – How does the overall system meet organizational, industry and legislative regulatory requirements–again, despite being made up of components from a variety of vendors who may themselves provide computing in a variety of legal jurisdictions?

Now comes the fun of breaking these down a bit, and talking about specific technologies and practices that can address them. Please, give me your feedback (or write up your criticism on your own blog, but link here so I can find you). Point me towards references to other ways to think about the problem. I look forward to the conversation.


Cassatt Announces Active Response 5.1 with Demand Base Policies

Ken Oestreich blogged recently about the very cool, probably landmark release of Cassatt that just became available, Cassatt Active Response 5.1. He very eloquently runs down the biggest feature–demand based policies–so I won’t repeat all of that here. What I thought I would do instead is relate my personal thoughts on monitoring based policies and how they are the key disruptive technology for data centers today.

To be sure, everyone is talking about server virtualization in the data center market today, and that’s fine. It’s core short-term benefit, physical system consolidation and increased utilization is key for cost-constrained IT departments, and features such as live motion and automatic backup are creating new opportunities that should be carefully considered. However, virtualization alone is limited in its applications, and does little to actually optimize a data center over time. (This is why VMWare is emphasizing management over just virtualizing servers these days.)

The technology that will make the long term difference is resource optimization: applying automation technologies to tuning how and when physical and virtual infrastructure is used to solve specific business needs. It is the automation software that will really change the “deploy and babysit” culture of most data centers and labs today. The new description will be more like “deploy and ignore”.

To really optimize resource usage in real time, the automation software must use a combination of monitoring (aka “measure“), a policy engine or other logic system (aka “analyze“) and interfaces to the control systems of the equipment and software it is managing (aka “respond“). It turns out that the “respond” part of the equation is actually pretty straight forward–lots of work, but straight forward. Just write “driver” like components that know how to talk to various data center equipment (e.g. Windows, DRAC, Cisco NX-OS, NetApp Data ONTAP, etc.), as well as handle error conditions by directly responding or forwarding the information to the policy engine.

The other two, however, require more immediate configuration by the end user. Measure and analyze, in fact, are where the entire set of Service Level Automation (SLAuto) parameters are defined and executed on. So, this is where the key user interface between the SLAuto system and end user has to happen.

What Cassatt has announced is a new user interface to define demand based policies as the end user sees fit. For example, what defines an idle server? Some systems use very little CPU while they wait for something to happen (at which point they get much busier), so simply measuring CPU isn’t good enough in those cases. Ditto for memory in systems that are compute intensive but handle very little state.

What Cassatt did that is so brilliant (and so unique) is to allow the end user to leverage the full range of SNMP attributes for their OS, as well as JMX and even scripts running on the monitored system to create expressions that define an idle metric that is right for that system. For example, on a test system you may in fact say that a system is idle when the master test controller software indicates that no test is being run on that box. On another system, you may say its idle when no user accounts are currently active. Its up to you to define when to attempt to shut down a box, or reduce capacity for a scale-out application.

Even when such an “idle” system is identified, Cassatt gives you the ability to go further and write some “spot checks” to make sure they system is actually OK to shut down. For example, in the aforementioned test system, Cassatt may determine that its worth trying to power down a system, but a spot check could be run to determine if a given process is still running, or an administrator account is currently actively logged in to the box that would indicate to Cassatt that it should ignore that system for now.

I know of no one else that has this level of GUI configurable monitor/analyze/respond sophistication today. If anyone wants to challenge that, feel free. Now that I no longer work at Cassatt, I’d be happy to learn about (and write about) alternatives in the marketplace. Just remember that it has to be easy to configure and execute these policies, and scripting the policies themselves is not good enough.

It is clear from the rush to release resource optimization products for the cloud, such as RightScale, Scalr, and others, that this will be a key feature for distributed systems moving forward. In my opinion, Cassatt has launched itself into the lead spot for on premises enterprise utility computing. I can’t wait to see who responds with the next great advancement.

Disclaimer: I am a Cassatt shareholder (or soon will be).

An easy way to get started with SLAuto

September 4, 2007 4 comments

It’s been an interesting week, leading up to the Labor Day weekend, but as of this morning I get to talk more openly about one project that has been taking a great deal of my time. As I have blogged about Service Level Automation (“SLAuto”), it may have dawned on some of you that achieving nirvana here means changing a lot about your current architecture and practices.

For example, decoupling software from hardware is easy to say, but requires significant planning and execution to implement (though this can be simplified somewhat with the right platform). Building the correct monitors, policies and interfaces is also time intensive work that requires the correct platform for success. However, as noted before, the biggest barriers to implementing SLAuto and utility computing are cultural.

There is an opportunity out there right now to introduce SLAuto without all of the trappings of utility computing, especially the difficult decoupling of software from hardware. It is an opportunity that the Silicon Valley is going ga-ga over, and it is a real problem with real dollar costs for every data center on the planet.

The opportunity is energy consumption management, aka the “green data center”.

Rather than pitch Cassatt’s solution directly, I prefer to talk about the technical opportunity as a whole. So let’s evaluate what is going on in the “GDC” space these days. As I see it, there are three basic technical approaches to “green” right now:

  1. More efficient equipment, e.g. more power efficient chips, server architectures, power distribution systems, etc.
  2. More efficient cooling, e.g. hot/cold aisles, liquid cooling, outside air systems, etc.
  3. Consolidation, e.g. virtualization, mainframes, etc.

Still, there is something obvious missing here: no matter which of these technologies you consider, not one of them is actually going to turn off unused capacity. In other words, while everyone is working to build a better light bulb or to design your lighting so you need fewer bulbs, no one is turning off the lights when no-one is in the room.

That’s where SLAuto comes in. I contend that there are huge tracks of computing in any large enterprise where compute capacity runs idle for extended periods. Desktop systems are certainly one of the biggest offenders, as are grid computing environments that are not pushed to maximum capacity at all times. However, possibly the biggest offender in any organization that does in-house development, extensive packaged system customization or business system integration is the dev/test environment.

Imagine such a lab where capacity that will be unused each evening/weekend, or for all but two weeks of a typical development cycle, or at all times except when testing a patch to a three year old rev of product, was shut down until needed. Turned off. Non-operational. Idle, but not idling.

Of course, most lab administrators probably feel extremely uncomfortable with this proposition. How are you going to do this without affecting developer/QA productivity? How do you know its OK to turn off a system? Why would my engineers even consider allowing their systems to be managed this way?

SLAuto addresses these concerns by simply applying intelligence to power management. A policy-based approach means a server can be scheduled for shutdown each evening (say, at 7PM), but be evaluated before shutdown against a set of policies that determine whether it is actually OK to complete the shut down.

Some example policies might be:

  • Are certain processes running that indicate a development/build/test task is still underway?
  • Is a specific user account logged in to the system right now?
  • Has disk activity been extremely low for the last four hours?
  • Did the owner of the server or one of his/her designated colleagues “opt-out” of the scheduled shutdown for that evening?

Once these policies are evaluated, we can see if the server meets the criteria to be shut down as requested. If not, keep it running. Such a system needs to also provide interfaces for both the data center administrators and the individual server owners/users to control the power state of their systems at all times, set policies and monitor power activities for managed servers.

I’ll talk more about this in the coming week, but I welcome your input. Would you shut down servers in your lab? Your grid environment? Your production environment? What are your concerns with this approach? What policies come to mind that would be simple and/or difficult to implement?

A Helping Hand Comes In Handy Sometimes

You may remember my recent post on how data centers resemble complex adaptive systems. This description of a data center has a glaring difference from a true definition of complex adaptive systems, however; data centers require some form of coordinated management beyond what any single entity can provide. In a truly complex adaptive system, there would be no “policy engines” or even Network Operations Centers. Each server, each switch, each disk farm would attempt to adapt to its surroundings, and either survive or die.

Therein lies the problem, however. Unlike a biological system, or the corporate economy, or even a human society, data centers cannot afford to have one of its individual entities (or “agents” in complex systems parlance) arbitrarily disappear from the computing environment. It certainly cannot rely on “trial and error” to determine what survives and what doesn’t. (Of course, in terms of human management of IT, this is often what happens, but never mind…)

Adam Smith called the force that guided selfish individuals to work together for the common benefit of the community the “invisible hand“. The metaphor is good for explaining how decentralized adaptive systems can organize for the greater good without a guiding force, but the invisible hand depends on the failure of those agents who don’t adapt.

Data centers, however, need a “visible hand” to quickly correct some (most?) agent failures. To automate and scale this, certain omnipotent and omnipresent management systems must be mixed into the data center ecology. These systems are responsible for maintaining the “life” of dying agents, particularly if the agents lose the ability to heal themselves.

Now, a topic for another post is the following: can several individual resource pools, each with their own policy engine, be joined together in a completely decentralized model?

Service Level Automation Deconstructed: Respond

For the third and last in my series breaking down the three key assumptions behind Service Level Automation, I would like to focus on how SLA environments can control data center configuration in response to service level goal violations. These goal violations and the high level actions to be taken are determined by the analysis capability of the environment. Details of how to accomplish those high level actions, however, are decided and executed by the response function.

Essentially, the response function of an SLA environment is very much like the driver set that your operating system uses to translate high level actions (e.g. “store this file”) to device specific actions (“Move head 32 steps to center, find block 4D5EF, etc.”). The responsibility here is to provide the interface between the SLA analysis engine and specific standard or proprietary interfaces to everything from server hardware to network switches to operating systems and middleware.

I see the following key interface points in today’s environments:

  • Power Controllers/MPDUs: Job 1 of a service level automation environment is providing the resources required to meet the needs of the software environment, and only those resources. Turn those servers on when they are needed, and off when they are not. This includes virtual server hosts. (Examples: DRAC, iLO, RSA II,MPDUs)
  • Operating Systems: Before you shut off that server, make sure you’ve “gently” shut down its software payload. Well written server payloads for automated environments will both start up and aquire intial state (if any), and shut down while preserving any necessary state without human intervention. However, from a communications perspective, each action starts with the OS. (Examples: Red Hat, SuSE, MS Windows, Sun Solaris)
  • Middleware/Virtualization: It is interesting to note that many software payload components (e.g. an application server or a hypervisor) are both software to be managed, and computing capacity themselves. For example, an application server should be managed to specific service levels relating to its relationship with its host server (e.g. CPU utilization, thread counts, etc.), while also treated as a capacity resource for JavaEE applications and services. As such, these software containers should be managed for their own guest payloads much like a physical server would for the overall server payload. (BEA Weblogic, VMWare ESX, XenSource XenEnterprise)
  • Layer 2 Networking: In order to use a server to meet an application’s needs, that server must have access to required networks. True automation requires that switch ports be reconfigured as necessary to ensure access to specifically the VLANs required by the payloads they will represent. (Examples: Cisco 3750, Extreme Summit400)
  • Network Attached Storage (NAS): The beauty of NAS devices is that they can be dynamically attached to a software payload at startup, without requiring any hardware configuration beyond the Layer 2 configuration described above. SAN is also useful (and common), but requires hardware configuration to make work. That complicates the role of automation. Part of the problem is the inconsistent remote configurability of fiber switches, which may be mitigated somewhat with iSCSI. However, NAS is quickly becoming the preferred storage mechanism in large data centers. (Examples: NetApp FAS, Adaptec Snap Server)

Over time, I see the industry adding more and more “drivers” to manage more and more data center (and perhaps desktop) resources. Imagine a world in which each software and/or hardware vendor produced standard SLA drivers for each individual component that makes up your data center environment. Every switch, disk and server; every service, container and OS; even every light bulb and air conditioner are connected to a single service level policy engine in which business policy (including cost of operations) drives automated decisions about their use and disuse.

Its not here yet, but you won’t have to wait long…

I will use the label “respond” to tag posts related to response interfaces.

Complexity and the Data Center

I just finished rereading a science book that has been tremendously influential on how I now think of software development, data center management and how people interact in general. Complexity: The Emerging Science at the Edge of Order and Chaos, by M. Mitchell Waldrop, was originally published in 1992, but remains today the quintessential popular tome on the science of complex systems. (Hint: READ THIS BOOK!)

John Holland (as told in Waldrop’s history) defined complex systems as having the following traits:

  • Each complex system is a network of many “agents” acting in parallel
  • Each complex system has many levels of organization, with agents at any one level serving as the building blocks for agents at a higher level
  • Complex systems are constantly revising and rearranging their building blocks as they gain experience
  • All complex adaptive system anticipate the future (though this anticipation is usually mechanical and not conscious)
  • Complex adaptive systems have many niches, each of which can be exploited by an agent adapted to fill that niche

Now, I don’t know about you, but this sounds like enterprise computing to me. It could be servers, network components, software service networks, supply chain systems, the entire data center, the entire IT operations and organization, etc. What we are all building here is self organizing…we may think we have control, but we are all acting as agents in response to the actions and conditions imposed by all those other agents out there.

A good point about viewing IT as a complex system can be found in Johna Till Johnson‘s Networld article, “Complexity, crisis and corporate nets“. Johna’s article articulates a basic concept that I am still struggling to verbalize regarding the current and future evolution of data centers. We are all working hard to adapt to our environments by building architectures, organizations and processes that are resistant to failure. Unfortunately, entire “ecosystem” is bound to fail from time to time. And there is no way to predict how or when. The best you can do is prepare for the worse.

One of the key reasons that I find Service Level Automation so interesting is that it provides a key “gene” to the increasingly complex IT landscape; the ability to “evolve” and “heal” the physical infrastructure level. Combine this with good, resilient software architectures (e.g. SOA and BPM) and solid feedback loops (e.g. BAM, SNMP, JMX, etc.) and your job as the human “DNA” gets easier. And, as the dynamic and automated nature of these systems gets more sophisticated, our IT environments get more and more self organizing, learning new ways to optimize themselves (often with human help) even as the environment they are adapting to constantly changes.

In the end, I like to think that no matter how many boneheaded decisions corporate IT makes, no matter how many lousy standards or products are introduced to the “ecosystem”, the entire system will adjust and continually attempt to correct for our weaknesses. In the end, despite the rise and fall of individual agents (companies, technologies, people, etc.), the system will continually work to serve us better…at least until that unpredictable catastrophic failure tears it all down and we start fresh.

Service Level Automation Deconstructed: Introduction

Service Level Automation starts with three simple premises:

* The factors contributing to software service quality can be measured electronically.

* Runtime targets indicating high quality of service can be defined for those measurements.

* Systems involved in delivering software functionality can be manipulated to keep those measurements within the runtime targets.

I think the support for each of these premises should be explored more deeply, so I plan to begin a little survey of the technologies and academics over the next few weeks. The idea is to get a good sense of what standards/technologies/concepts/etc. can be used to meet the requirements of each premise. I also hope to discuss how a system smart enough to take advantage of them(*) can save a large datacenter both in terms of direct costs, as well as in losses due to service level failures.

Why Service Level Automation? I wrote about this earlier. However, as a quick reminder, think of service level automation as meeting this objective:

Delivering the quantity and quality of service flow required by the business using the minimum resources required to do so.

I’ve been quite busy both at work and at home, so I’m hoping to use this exercise as a way to increase my posting frequency. Stay tuned for more.