Archive for the ‘measure’ Category

Two Announcements to Pay Attention To This Week

November 12, 2008 Leave a comment

I know I promised a post on how the network fits into cloud computing, but after a wonderful first part of my week spending time first catching up on reading, then one-on-one with my 4-year old son, I’m finally digging into what’s happened in the last two days in the cloud-o-sphere. While the network post remains important to me, there were several announcements that caught my eye, and I thought I’d run through two of them quickly and give you a sense of why they matter

The first announcement came from Replicate Technologies, Rich Miller’s young company, which is focusing initially on virtualization configuration analysis. The Replicate Datacenter Analyzer (RDA) is a powerful analysis and management tool for evaluating the configuration and deployment of virtual machines in an enterprise data center environment. But it goes beyond just evaluating the VMs itself, to evaluating the server, network and storage configuration required to support things like vMotion.

Sound boring, and perhaps not cloud related? Well, if you read Rich’s blog in depth, you may find that he has a very interesting longer term objective. Building on the success of RDA, Replicate aims to become a core element of a virtualized data center operations platform, eventually including hybrid cloud configurations, etc. While initially focused on individual servers, one excellent vision that Rich has is to manage the relationships between VMs in such a tool, so that operations taken on one server will take into account the dependencies on other servers. Very cool.

Watch the introductory video for the fastest impression of what Replicate has here. If you manage virtual machines, sit up and take notice.

The other announcement that caught my eye was the new positioning and features introduced by my alma mater, Cassatt Corporation this week. I’ve often argued that Cassatt is an excellent example of a private cloud infrastructure, and now they are actively promoting themselves as such (although they use the term “internal cloud”).

It’s about freaking time. With a mature, “burned in”, relatively technology agnostic platform that has perhaps the easiest policy management user experience ever (though not necessarily the prettiest), Cassatt has always been one of my favorite infrastructure plays (though I admit some bias). They support an incredible array of hardware, virtualization and OS platforms, and provide the rare ability to manage not only virtual machines, but also bare metal systems. You get automated power management, resource optimization, image management, and dynamic provisioning. For the latter, not only is server provisioning automated, but also network provisioning–such that deploying an image on a server triggers Cassatt to reprogram the ports that the target server is connected to so that they sit on the correct VLAN for the software about to be booted.

The announcement talks a lot about Cassatt’s monitoring capabilities, and a service they provide around application profiling. I haven’t been briefed about these, but given their experience with server power management (a very “profiling focused” activity) I believe they could probably have some unique value-add there. What I remember from six months ago was that they introduced improved dynamic load allocation capabilities that could use just about any digital metric (technical or business oriented) to set upper and lower performance thresholds for scaling. Thus, you could use CPU utilization, transaction rates, user sessions or even market activity to determine the need for more or less servers for an application. Not too many others break away from the easy CPU/memory utilization stats to drive scale.

Now, having said all those nice things, I have to take Cassatt to task for one thing. Throughout the press release, Cassatt talks about Amazon and Google like infrastructure. However, Cassatt is doing nothing to replicate the APIs of either Amazon (which would be logical) or Google (which would make no sense at all). In other words, as announced, Cassatt is building on their own proprietary protocols and interfaces, with no ties to any external clouds or alternative cloud platforms. This is not a very “commodity cloud computing” friendly approach, and obviously I would like to see that changed. And, the truth is, none of their direct competitors are doing so, either (with the possible exception of the academic research project, EUCALYPTUS).

The short short of it is if you are looking at building a private cloud, don’t overlook Cassatt.

There was another announcement from Hyperic that I want to comment on, but I’m due to chat with a Hyperic executive soon, so I’ll reserve that post for later. The fall of 2008 remains a heady time for cloud computing, so expect many more of these types of posts in the coming weeks.


Elements of a Cloud Oriented Architecture

In my post, The Principles of Cloud Oriented Architectures, I introduced you to the concept of a software system architecture designed with “the cloud” in mind:

“…I offer you a series of posts…describing in depth my research into what it takes to deliver a systems architecture with the following traits:

  1. It partially or entirely incorporates the clouds for at least one layer of the Infrastructure/Platform/Application stack.
  2. Is focused on consumers of cloud technologies, not the requirements of those delivering cloud infrastructures, either public or private (or even dark).
  3. Takes into account a variety of technical, economic and even political factors that systems running in the “cloud” must take into account.
  4. Is focused at least as much on the operational aspects of the system as the design and development aspects

The idea here is not to introduce an entirely new paradigm–that’s the last thing we need given the complexity of the task ahead of us. Nor is it to replace the basic principles of SOA or any other software architecture. Rather, the focus of this series is on how to best prepare for the new set of requirements before us.”

I followed that up with a post (well, two really) that set out to define what our expectations of “the cloud” ought to be. The idea behind the Cloud Computing Bill of Rights was not to lay out a policy platform–though I am flattered that some would like to use it as the basis of one— but rather to set out some guidelines about what cloud computing customers should anticipate in their architectures. In this continuing “COA principles” series, I intend to lay out what can be done to leverage what vendors deliver, and design around what they fail to deliver.

With that basic framework laid out, the next step is to break down what technology elements need to be considered when engineering for the cloud. This post will cover only the list of some such elements as I understand them today (and feel free to use the comments below to add your own insights), and future posts will provide a more thorough analysis of individual elements and/or related groups of elements. The series is really very “stream of consciousness”, so don’t expect too much structure or continuity.

When considering what elements matter in a Cloud Oriented Architecture, we consider first that we are talking about distributed systems. Simply utilizing to do your Customer Relationship Management doesn’t require an architecture; integrating it with your SAP billing systems does. As your SAP systems most likely don’t run in data centers, the latter is a distributed systems problem.

Most distributed systems problems have just a few basic elements. For example:

  • Distribution of responsibilities among component parts

  • Dependency management between those component parts

  • Scalability and reliability

    • Of the system as a whole
    • Of each component
  • Data Access and Management

  • Communication and Networking

  • Monitoring and Systems Management

However, because cloud computing involves leveraging services and systems entirely outside of the architect’s control, several additional issues must be considered. Again, for example:

  • How are the responsibilities of a complex distributed system best managed when the services being consumed are relatively fixed in the tasks they can perform?

  • How are the cloud customer’s own SLA commitments best addressed when the ability to monitor and manage components of the system may be below the standards required for the task?

  • How are the economics of the cloud best leveraged?

    • How can a company gain the most work for the least amount of money?
    • How can a company leverage the cloud marketplace for not just cost savings, but also increased availability and system performance?

In an attempt to address the more cloud-specific distributed systems architecture issues, I’ve come up with the following list of elements to be addressed in a typical Cloud Oriented Architecture:

  • Service Fluidity – How does the system best allow for static redeployment and/or “live motion” of component pieces within and across hardware, facility and network boundaries? Specific issues to consider here include:

    • Distributed application architecture, or how is the system designed to manage component dependencies while allowing the system to dynamically find each component as required? (Hint: this problem has been studied thoroughly by such practices as SOA, EDA, etc.)
    • Network resiliency, or how does the system respond to changes in network location, including changes in IP addressing, routing and security?
  • Monitoring – How is the behavior and effectiveness of the system measured and tracked both to meet existing SLAs, as well as to allow developers to improve the overall system in the future? Issues to be considered here include:

    • Load monitoring, or how do you measure system load when system components are managed by multiple vendors with little or know formal agreements of how to share such data with the customer or each other?
    • Cost monitoring, or how does the customer get a accurate accounting of the costs associated with running the system from their point of view?
  • Management – How does the customer configure and maintain the overall system based on current and ongoing technical and business requirements? Examples of what needs to be considered here includes:

    • Cost, or what adjustments can be made to the system capacity or deployment to provide the required amount of service capacity at the lowest cost possible? This includes ways to manage the efficiency of computation, networking and storage.
    • Scalability, or how does the system itself allow changes to capacity to meet required workloads? These changes can happen:
      • vertically (e.g. get a bigger box for existing components–physically or virtually)
      • horizontally (e.g. add or remove additional instances of one or more components as required)
      • from a network latency perspective (adjust the ways in which the system accesses the network in order to increase overall system performance)
    • Availability, or how does the system react to failure or any one component, or any group of components (e.g. when an entire vendor cloud goes offline)?
  • Compliance – How does the overall system meet organizational, industry and legislative regulatory requirements–again, despite being made up of components from a variety of vendors who may themselves provide computing in a variety of legal jurisdictions?

Now comes the fun of breaking these down a bit, and talking about specific technologies and practices that can address them. Please, give me your feedback (or write up your criticism on your own blog, but link here so I can find you). Point me towards references to other ways to think about the problem. I look forward to the conversation.

Update: The Cloud Computing Bill of Rights

Thanks to all that provided input on the first revision of the Cloud Computing Bill of Rights. The feedback has been incredible, including several eye opening references, and some basic concepts that were missed the first time through. An updated “CCBOR” is below, but I first want to directly outline the changes, and credit those that provided input.

  1. Abhishek Kumar points out that government interference in data privacy and security rights needs to be explicitly acknowledged. I hear him loud and clear, though I think the customer can expect only that laws will remain within the constitutional (or doctrinal) bounds of their particular government, and that government retains the right to create law as it deems necessary within those parameters.

    What must also be acknowledged, however, is that customers have the right to know exactly what laws are in force for the cloud systems they choose to use. Does this mean that vendors should hire civil rights lawyers, or that the customer is on their own to figure that out? I honestly don’t know.

  2. Peter Laird’s “The Good, Bad, and the Ugly of SaaS Terms of Service, Licenses, and Contracts” is a must read when it comes to data rights. It finds for enterprises what was observed by NPR the other night for individuals; that you have very few data privacy rights right now, that your provider probably has explicit provisions protecting them and exposing you or your organization, and the cloud exposes risks that enterprises avoid by owning their own clouds.

    This reinforces the notion that we must understand that privacy is not guaranteed in the cloud, no matter what your provider says. As Laird puts it:

    “…[A] customer should have an explicit and absolute right to data ownership regardless of how a contract is terminated.”

  3. Ian Osbourne asks “should there be a right to know where the data will be stored, and potentially a service level requirement to limit host countries?” I say absolutely! It will be impossible for customers to obey laws globally unless data is maintained in known jurisdictions. This was the catalyst for the “Follow the Law Computing” post. Good catch!

  4. John Marsh of GeekPAC links to his own emerging attempt at a Bill of Rights. In it, he points out a critical concept that I missed:

    “[Vendors] may not terminate [customer] account[s] for political statements, inappropriate language, statements of sexual nature, religious commentary, or statements critical of [the vendor’s] service, with exceptions for specific laws, eg. hate speech, where they apply.”

    Bravo, and noted.

  5. Unfortunately, the federal courts have handed down a series of rulings that challenge the ability of global citizens and businesses to do business securely and privately in the cloud. This Bill of Rights is already under grave attack.

Below is the complete text of the second revision of the Cloud Computing Bill of Rights. Let’s call the first “CCBOR 0.1” and this one “CCBOR 0.2”. I’ll update the original post to reflect the versioning.

One last note. My intention in presenting this post was not to become the authority on cloud computing consumer rights. It is, rather, the cornerstone of my Cloud Computing Architecture discussion, in which I need to move on to the next point. I’m working on setting up a WIKI for this “document”. Is there anyone out there in particular that would like to host it?

The Cloud Computing Bill of Rights (0.2)

In the course of technical history, there exist few critical innovations that forever change the way technical economies operate; forever changing the expectations that customers and vendors have of each other, and the architectures on which both rely for commerce. We, the parties entering into a new era driven by one such innovation–that of network based services, platforms and applications, known at the writing of this document as “cloud computing”–do hereby avow the following (mostly) inalienable rights:

  • Article I: Customers Own Their Data

    1. No vendor shall, in the course of its relationship with any customer, claim ownership of any data uploaded, created, generated, modified, hosted or in any other way associated with the customer’s intellectual property, engineering effort or media creativity. This also includes account configuration data, customer generated tags and categories, usage and traffic metrics, and any other form of analytics or meta data collection.

      Customer data is understood to include all data directly maintained by the customer, as well as that of the customer’s own customers. It is also understood to include all source code and data related to configuring and operating software directly developed by the customer, except for data expressly owned by the underlying infrastructure or platform provided by the vendor.

    2. Vendors shall always provide, at a minimum, API level access to all customer data as described above. This API level access will allow the customer to write software which, when executed against the API, allows access to any customer maintained data, either in bulk or record-by-record as needed. As standards and protocols are defined that allow for bulk or real-time movement of data between cloud vendors, each vendor will endeavor to implement such technologies, and will not attempt to stall such implementation in an attempt to lock in its customers.

    3. Customers own their data, which in turn means they own responsibility for the data’s security and adherence to privacy laws and agreements. As with monitoring and data access APIs, vendors will endeavor to provide customers with the tools and services they need to meet their own customers’ expectations. However, customers are responsible for determining a vendor’s relevancy to specific requirements, and to provide backstops, auditing and even indemnification as required by agreements with their own customers.

      Ultimately, however, governments are responsible for the regulatory environments that define the limits of security and privacy laws. As governments can choose any legal requirement that works within the constraints of their own constitutions or doctrines, customers must be aware of what may or may not happen to their data in the jurisdictions in which data resides, is processed or is referenced. As constitutions vary from country to country, it may not even be required for governments to inform customers what specific actions are taken with or against their data. That laws exist that could put their data in jeopardy, however, is the minimum that governments convey to the market.

      Customers (and their customers) must leverage the legislative mechanisms of any jurisdiction of concern to change those parameters.

      In order for enough trust to be built into the online cloud economy, however, governments should endeavor to build a legal framework that respects corporate and individual privacy, and overall data security. While national security is important, governments must be careful not to create an atmosphere in which the customers and vendors of the cloud distrust their ability to securely conduct business within the jurisdiction, either directly or indirectly.

    4. Because regulatory effects weigh so heavily on data usage, security and privacy, vendors shall, at a minimum, inform customers specifically where their data is housed. A better option would be to provide mechanisms by which users can choose where their data will be stored. Either way, vendors should also endeavor to work with customers to assure that their systems designs do not conflict with known legal or regulatory obstacles. This is assumed to apply to primary, backup and archived data instances.
  • Article II: Vendors and Customers Jointly Own System Service Levels

    1. Vendors own, and shall do everything in their power to meet, service level targets committed to with any given customer. All required effort and expense necessary to meet those explicit service levels will be spent freely and without additional expense to the customer. While the specific legally binding contracts or business agreements will spell out these requirements, it is noted here that these service level agreements are entered into expressly to protect both the customer’s and vendor’s business interests, and all decisions by the vendor will take both parties equally into account.

      Where no explicit service level agreement exists with a customer, the vendor will endeavor to meet any expressed service level targets provided in marketing literature or the like. At no time will it be acceptable for a vendor to declare a high level of service at a base price, only to later indicate that that level of service is only available at a higher premium price.

      It is perfectly acceptable, however, for a vendor to expressly sell a higher level of service at a higher price, as long as they make that clear at all points where a customer may evaluate or purchase the service.

    2. Ultimately, though, customers own their service level commitments to their own internal or external customers, and the customer understands that it is their responsibility to take into account possible failures by each vendor that they do business with.

      Customers relying on a single vendor to meet their own service level commitments enter into an implicit agreement to tie their own service level commitments to the vendor’s, and to live and die by the vendor’s own infrastructure reliability. Those customers who take their own commitments seriously will seek to build or obtain independent monitoring, failure recovery and disaster recovery systems.

    3. Where customer/vendor system integration is necessary, the vendor’s must offer options for monitoring the viability of that integration at as many architectural levels as required to allow the customer to meet their own service level commitments. Where standards exist for such monitoring, the vendor will implement those standards in a timely and complete fashion. The vendor should not underestimate the importance of this monitoring to the customer’s own business commitments.

    4. Under no circumstances will vendors terminate customer accounts for political statements, inappropriate language, statements of sexual nature, religious commentary, or statements critical of the vendor’s service, with exceptions for specific laws, e.g. hate speech, where they apply.
  • Article III: Vendors Own Their Interfaces

    1. Vendors are under no obligation to provide “open” or “standard” interfaces, other than as described above for data access and monitoring. APIs for modifying user experience, frameworks for building extensions or even complete applications for the vendor platform, or such technologies can be developed however the vendor sees fit. If a vendor chooses to require developers to write applications in a custom programming language with esoteric data storage algorithms and heavily piracy protected execution systems, so be it.

      If it seems that this completely abdicates the customer’s power in the business relationship, this is not so. As the “cloud” is a marketplace of technology infrastructures, platforms and applications, the customer exercises their power by choosing where to spend their hard earned money. A decision to select a platform vendor that locks you into proprietary Python libraries, for instance, is a choice to support such programming lock-in. On the other hand, insistence on portable virtual machine formats will drive the market towards a true commodity compute capacity model.

      The key reason for giving vendors such power is to maximize innovation. By restricting how technology gets developed or released, the market risks restricting the ways in which technologists can innovate. History shows that eventually the “open” market catches up to most innovations (or bypasses them altogether), and the pace at which this happens is greatly accelerated by open source. Nonetheless, forcing innovation through open source or any other single method runs the risk of weakening capitalist entrepreneurial risk taking.

    2. The customer, however, has the right to use any method legally possible to extend, replicate, leverage or better any given vendor technology. If a vendor provides a proprietary API for virtual machine management in their cloud, customers (aka “the community”, in this case) have every right to experiment with “home grown” implementations of alternative technologies using that same API. This is also true for replicating cloud platform functionality, or even complete applications–though, again, the right only extends to legal means.

      Possibly the best thing a cloud vendor can do to extend their community, and encourage innovation on their platform from community members is to open their platform as much as possible. By making themselves the “reference platform” for their respective market space, an open vendor creates a petrie dish of sorts for cultivating differentiating features and successes on their platform. Protective proprietary vendors are on their own.

These three articles serve as the baseline for customer, vendor and, as necessary, government relationships in the new network-based computing marketplace. No claim is made that this document is complete, or final. These articles may be changed or extended at any time, and additional articles can be declared, whether in response to new technologies or business models, or simply to reflect the business reality of the marketplace. It is also a community document, and others are encouraged to bend and shape it in their own venues.

Cassatt Announces Active Response 5.1 with Demand Base Policies

Ken Oestreich blogged recently about the very cool, probably landmark release of Cassatt that just became available, Cassatt Active Response 5.1. He very eloquently runs down the biggest feature–demand based policies–so I won’t repeat all of that here. What I thought I would do instead is relate my personal thoughts on monitoring based policies and how they are the key disruptive technology for data centers today.

To be sure, everyone is talking about server virtualization in the data center market today, and that’s fine. It’s core short-term benefit, physical system consolidation and increased utilization is key for cost-constrained IT departments, and features such as live motion and automatic backup are creating new opportunities that should be carefully considered. However, virtualization alone is limited in its applications, and does little to actually optimize a data center over time. (This is why VMWare is emphasizing management over just virtualizing servers these days.)

The technology that will make the long term difference is resource optimization: applying automation technologies to tuning how and when physical and virtual infrastructure is used to solve specific business needs. It is the automation software that will really change the “deploy and babysit” culture of most data centers and labs today. The new description will be more like “deploy and ignore”.

To really optimize resource usage in real time, the automation software must use a combination of monitoring (aka “measure“), a policy engine or other logic system (aka “analyze“) and interfaces to the control systems of the equipment and software it is managing (aka “respond“). It turns out that the “respond” part of the equation is actually pretty straight forward–lots of work, but straight forward. Just write “driver” like components that know how to talk to various data center equipment (e.g. Windows, DRAC, Cisco NX-OS, NetApp Data ONTAP, etc.), as well as handle error conditions by directly responding or forwarding the information to the policy engine.

The other two, however, require more immediate configuration by the end user. Measure and analyze, in fact, are where the entire set of Service Level Automation (SLAuto) parameters are defined and executed on. So, this is where the key user interface between the SLAuto system and end user has to happen.

What Cassatt has announced is a new user interface to define demand based policies as the end user sees fit. For example, what defines an idle server? Some systems use very little CPU while they wait for something to happen (at which point they get much busier), so simply measuring CPU isn’t good enough in those cases. Ditto for memory in systems that are compute intensive but handle very little state.

What Cassatt did that is so brilliant (and so unique) is to allow the end user to leverage the full range of SNMP attributes for their OS, as well as JMX and even scripts running on the monitored system to create expressions that define an idle metric that is right for that system. For example, on a test system you may in fact say that a system is idle when the master test controller software indicates that no test is being run on that box. On another system, you may say its idle when no user accounts are currently active. Its up to you to define when to attempt to shut down a box, or reduce capacity for a scale-out application.

Even when such an “idle” system is identified, Cassatt gives you the ability to go further and write some “spot checks” to make sure they system is actually OK to shut down. For example, in the aforementioned test system, Cassatt may determine that its worth trying to power down a system, but a spot check could be run to determine if a given process is still running, or an administrator account is currently actively logged in to the box that would indicate to Cassatt that it should ignore that system for now.

I know of no one else that has this level of GUI configurable monitor/analyze/respond sophistication today. If anyone wants to challenge that, feel free. Now that I no longer work at Cassatt, I’d be happy to learn about (and write about) alternatives in the marketplace. Just remember that it has to be easy to configure and execute these policies, and scripting the policies themselves is not good enough.

It is clear from the rush to release resource optimization products for the cloud, such as RightScale, Scalr, and others, that this will be a key feature for distributed systems moving forward. In my opinion, Cassatt has launched itself into the lead spot for on premises enterprise utility computing. I can’t wait to see who responds with the next great advancement.

Disclaimer: I am a Cassatt shareholder (or soon will be).

An easy way to get started with SLAuto

September 4, 2007 4 comments

It’s been an interesting week, leading up to the Labor Day weekend, but as of this morning I get to talk more openly about one project that has been taking a great deal of my time. As I have blogged about Service Level Automation (“SLAuto”), it may have dawned on some of you that achieving nirvana here means changing a lot about your current architecture and practices.

For example, decoupling software from hardware is easy to say, but requires significant planning and execution to implement (though this can be simplified somewhat with the right platform). Building the correct monitors, policies and interfaces is also time intensive work that requires the correct platform for success. However, as noted before, the biggest barriers to implementing SLAuto and utility computing are cultural.

There is an opportunity out there right now to introduce SLAuto without all of the trappings of utility computing, especially the difficult decoupling of software from hardware. It is an opportunity that the Silicon Valley is going ga-ga over, and it is a real problem with real dollar costs for every data center on the planet.

The opportunity is energy consumption management, aka the “green data center”.

Rather than pitch Cassatt’s solution directly, I prefer to talk about the technical opportunity as a whole. So let’s evaluate what is going on in the “GDC” space these days. As I see it, there are three basic technical approaches to “green” right now:

  1. More efficient equipment, e.g. more power efficient chips, server architectures, power distribution systems, etc.
  2. More efficient cooling, e.g. hot/cold aisles, liquid cooling, outside air systems, etc.
  3. Consolidation, e.g. virtualization, mainframes, etc.

Still, there is something obvious missing here: no matter which of these technologies you consider, not one of them is actually going to turn off unused capacity. In other words, while everyone is working to build a better light bulb or to design your lighting so you need fewer bulbs, no one is turning off the lights when no-one is in the room.

That’s where SLAuto comes in. I contend that there are huge tracks of computing in any large enterprise where compute capacity runs idle for extended periods. Desktop systems are certainly one of the biggest offenders, as are grid computing environments that are not pushed to maximum capacity at all times. However, possibly the biggest offender in any organization that does in-house development, extensive packaged system customization or business system integration is the dev/test environment.

Imagine such a lab where capacity that will be unused each evening/weekend, or for all but two weeks of a typical development cycle, or at all times except when testing a patch to a three year old rev of product, was shut down until needed. Turned off. Non-operational. Idle, but not idling.

Of course, most lab administrators probably feel extremely uncomfortable with this proposition. How are you going to do this without affecting developer/QA productivity? How do you know its OK to turn off a system? Why would my engineers even consider allowing their systems to be managed this way?

SLAuto addresses these concerns by simply applying intelligence to power management. A policy-based approach means a server can be scheduled for shutdown each evening (say, at 7PM), but be evaluated before shutdown against a set of policies that determine whether it is actually OK to complete the shut down.

Some example policies might be:

  • Are certain processes running that indicate a development/build/test task is still underway?
  • Is a specific user account logged in to the system right now?
  • Has disk activity been extremely low for the last four hours?
  • Did the owner of the server or one of his/her designated colleagues “opt-out” of the scheduled shutdown for that evening?

Once these policies are evaluated, we can see if the server meets the criteria to be shut down as requested. If not, keep it running. Such a system needs to also provide interfaces for both the data center administrators and the individual server owners/users to control the power state of their systems at all times, set policies and monitor power activities for managed servers.

I’ll talk more about this in the coming week, but I welcome your input. Would you shut down servers in your lab? Your grid environment? Your production environment? What are your concerns with this approach? What policies come to mind that would be simple and/or difficult to implement?

What do SOA and EDA have to do with SLA?

April 30, 2007 1 comment

I’ve been launching off of Todd Biske’s blog roll into the world of SOA and EDA blogging. I’m actually kind of saddened that my voyage into the world of infrastructure automation has pulled me so far from a world in which I was an early practitioner. (My career at Forte Software introduced me to service-oriented architectures and event-based systems long before even Java took off.) I love what the blogging world is doing for software architecture (and has been doing for some time now), and I feel like a kid in a candy store with all the cool ideas I’ve been running across.

One blog that has been capturing my interest is Jack van Hoof’s “SOA and EDA”. I love a blog with real patterns, term definition, and a passion for its subject matter. All put together by someone who can get an article published.

The article is actually very interesting to me from a Service Level Automation perspective. Jack captures his thoughts on the importance of building agile software architectures in the following paragraph:

Everything is moving toward on-demand business where service providers react to impulses – events – from the environment. To excel in a competitive market, a high level of autonomy is required, including the freedom to select the appropriate supporting systems. This magnified degree of separation creates a need for agility; a loose coupling between services so as to support continuous, unimpeded augmentation of business processes in response to the changing composition of the organizational structure.

(Emphasis mine.)

The only thing I would change about Jack’s statement above is replacing the words “a loose coupling between services” to “a loose coupling between services and between services and infrastructure” and changing “composition of the organizational structure” to “composition of the organizational structure and infrastructure environment”. (Some may have issues with the latter, but I don’t mean that services should be written with specific technology in mind–just the opposite; they should be written with an eye towards technology independence.)

This is why I have been emphasizing lately the need to view the measure activity through the lens of both business and technical measures. Some of the business events thrown by an EDA may very well indicate the need to change the infrastructure configuration (e.g. if the stock market sees a 20% rise in volume in the matter of three minutes, someone may want to add capacity to those trading systems). However, the technical events from a software system (e.g. thread counts or I/O latency) may also indicate the need to change infrastructure configuration on the fly.

I wish I could spend more time collaborating with SOA architects and “tacticians“. In fact, I have been speaking with Ken Oestreich about exactly this. If you are in the SOA space, and interested in talking about how SOA, EDA and SLA interconnect, let me know by commenting below. (Be sure to let me know how to contact you.) At the very least, think about how infrastructure will measure the performance of your software systems as you start your next development iteration.

Service Level Automation Deconstructed: Measuring Service Levels

This is the first of three in my series analyzing the key assumptions behind Service Level Automation. Specifically, today I want to focus on the measurement of business systems, and the concepts behind translating those measurements into service level metrics.

Rather than trying to do an exhaustive coverage of this topic (and the other topics in this series) in a single post, what I am going to do is provide a “first look” post now, then use labels when followup posts have relative information. The label for this topic will be “measure”.

In my next installment, I’ll introduce analysis of those metrics against service level objectives (SLO) the business requires. That post, and future related posts will be labeled with “analyze”.

In the final installment of the series, I’ll describe the techniques and technologies available to digitally manipulate these systems so that they run within SLO parameters. Posts related to that topic will be labeled “respond”.

As noted earlier, my objective is to survey the technologies, academics, etc., of each of these topics in an attempt to enlighten you about the science and technology that enables service level automation.

How do we measure quality of service?

Measuring quality of service is a complex problem, not so much because it is hard to measure information systems and business functionality. I (and I bet you) could list dozens of technical measurements that can be made on an application or service that would reflect some aspect of its current health. For example:

  • System statistics such as CPU utilization or free disk space, as reported by SNMP
  • Response to a ping or HTTP request
  • Checksum processing on network data transfers
  • Any of dozens of Web Services standards

The real problem is that human perception of quality of service isn’t (typically) based on any one of these measurements, but on a combination of measurements, where the specific combination may change based on when and how a given business function is being used.

For example, how do you measure the utilization of a Citrix environment? Measuring sessions/instances is a good start, but–as noted before with WTS–what happens when all sessions consume a large amount of CPU at once? CPU utilization, in turn, could fluctuate wildly as sessions are more or less active. Then again, what about memory utilization or I/O throughput? These could become critical completely independently from the others already mentioned.

No, what is needed is more mathematical–one (or a couple of) index(es) of sorts generated from a combination of the base metrics retrieved from the managed system.

There are tools that do this. They range from the basic capabilities available in a good automation tool, to the sophisticated evaluation and computation available in a more specialized monitoring tool.

What I am still searching for are standard metrics being collected by these tools, especially industry standard metrics and/or indexes that demonstrate the health of a datacenter or its individual components. I’ll talk more about what I find in the future, but welcome you to contribute here with links, comments, etc. to point me in the right direction.