Get Innovative About Performance
One aspect of IT management automation that has frustrated me for years is the relative lackluster progress we’ve made in the area of performance. With all the other wonderful innovations we’ve made, I’m confounded that performance remains largely in the dark ages. This need not be. This must not be. We are finally beginning to see welcome changes in both available software solutions and more importantly, attitudes toward performance. If we hope to attain true service management, this must be a priority.
The world’s requirements have shifted to a point where performance, done right, has become a necessity. The operative phrase here is done right. Many IT operations are stuck in the obsolete mode of approaching performance from a myopic perspective of the infrastructure and often based on arbitrary static thresholds.
I wrote a META Group paper in June 2003 entitled Performance Monitoring versus Performance Analysis that echoes much of what I state here. Despite the nearly 5 years of elapsed time since, little has changed. Many sit here in the genesis of 2008 and unfortunately still view performance through a 1998 lens. Vendors have been part of the problem, shipping products that do little more than collect raw metrics and generate reports. It is time to finally make the change … and we are. With more attention on best practices such as ITIL, the philosophical shift is finally happening. It is also coincident with a similar evolutionary step in automated management technologies.
Let us first examine the available reporting products and how they are changing. Performance tools fall into one or both of two categories: reporting and analysis. Reporting is the fairly simplistic category of tools to which I just alluded. Collection and reporting proved valuable in the early days (1990s) when we had no visibility at all. Our capacity challenges were centered around infrastructure (e.g., network links, servers) and our best alternatives were streams of raw textual data. The pretty graphs generated by performance reporters were a big step forward. Today, however, this type of reporting is so commoditized that some of the most popular tools are from the open source community (e.g., MRTG, Nagios).
Simple reporting is also very misleading. When thousands (maybe millions) of metrics are imposing varying degrees of influence on service quality, how can any level of manual review be possible? Indeed it is not possible, so the only alternative is to collect all the metrics and analyze only a few. This analysis is often only performed when trouble arises. Although such situations occur more often than they should, the likelihood that collected data is viewed, used, or otherwise accessed is very low. It turns out that over 95% of collected data is never touched. The remaining <5% gives a limited perspective of service health, but this subset drives a lot of decisions. When decisions are based on such a narrow view of the world, trouble is inevitable.
Automated performance analysis is where the real action lies. The best tools should encompass a wide swath of data, determine the relationships between the data elements, and process the data based on some admittedly esoteric mathematics. This is not easy, but it is necessary and a few vendors are stepping up to the challenge.
Two of the most promising vendors in the performance analysis market are Netuitive and NetQoS. Visit their web sites for much more than I can describe here, but the excitement I feel from both is because they are tackling the thorny issues of analysis with some genuinely cool technology based on some deep cerebral concepts that will make most of our heads spin. Yes, this is complex, but it is the responsibility of vendors to insulate the complexity and make it all seem easy. This is what Netuitive, NetQoS, and a small number of other innovators are doing. So far, the established vendors are mostly languishing in the old world view. It’s time for them to wake up and either innovate themselves, or go hunting for one of these smaller mavericks. BMC recently took the bold step to acquire Proactivenet. This was a good move and indicative of the future.
One of the simplest mechanisms, single-variable statistical baselining is finally becoming commonplace, even among the stodgier vendors. This form of baselining establishes a statistical pattern of normal behavior and determines when performance is straying outside the bounds of this normality. Being single-variable, it only focuses on one metric at a time, so it can monitor CPU utilization for a pattern, but it ignores the impact of memory swapping or network I/O.
Still, this is a huge improvement over static thresholds. Static thresholds are useful for a very limited set of metrics (e.g., application response time as defined in an
thresholds can generate lots of false alarms and ignore valid performance anomalies that lie below the threshold (or above, depending on the nature of the threshold). A good example is CPU utilization. If I set a threshold to alarm above 75%, it will alarm at 85% during a time when normal behavior is in the range of 80 to 95%. Conversely, if the CPU is at 20% during this time, it is an anomaly, yet a static threshold will ignore this situation and fail to notify.
Where analysis gets truly exciting is when we can extend beyond the single variable to determine the interrelationships between multiple variables. This is hideously difficult, but the innovators I mentioned earlier are now delivering such capabilities. These technologies will continue to evolve, as we are now only experiencing the infancy of automated analysis.
It doesn’t make sense to struggle with manual analysis based upon simple reporting. The world is getting far too complex for this mode. Let automation do the hard work for you. The technology is finally maturing to a point where it can offer great relief to the increasingly painful performance problems that plague most IT organizations.
Analysis has proven effective for fault management (evaluation of up/down conditions), but performance is a different animal. Whereas fault management deals with binary conditions of black and white, performance involves the full pallet of colors and shades of gray. Of course, dealing in colors is much more difficult than black and white, but help is now here.
In this age of HDTV, it’s a bit absurd that we still view performance in black and white!
January 24th, 2008 at 10:15
Great summary of the innovation in performance management and the desire for that to continue. I couldn’t have said it better myself!
January 24th, 2008 at 21:41
Thanks, Ryan! We performance bigots need to stick together!
January 25th, 2008 at 22:16
Great thoughts Glenn. How about we all do a review of the current players in this space. When I was evaluating replacement for an ailing and overworked Concord eHealth system at ELNK, we put all the vendors through the paces. Leaders in this space have many of the capabilities you outlined above and more. Our leading candidates at the time were InfoVista and Quallaby (acquired by Micromuse then IBM) and we ultimately recommended InfoVista due to the front end look and feel our clients desired. Multi-variable, complex analytic based scenarios could easily be modeled within these tools with composite events generated.
I see two silos in this area still - the service providers and the enterprise. Partly a matter of sheer scale, but automation, analytics, reporting and visualization advances are happening on a faster pace on the service provider side best as I can tell. I just see too many tools doing their own graphing, charting and performance management like things which continue to chip away at the goals of a consolidated performance management solution. We have no fewer than four or five (maybe more) tools in our IBM Tivoli portfolio that play in these areas on both sides of the fence. What we lack is a OOB capability as you’ve described above similar to ProactiveNet, Netuitive, and Integrien offer. This is often left to sophisticated programmatic rules engine development, suppression and correlation engines within our suite.
I think that plain old performance management is tainted as you’ve described. Being the BSM guy that I am, I think the real innovation may come when the performance of key business services, applications and systems can be looked at from end-to-end and directly tied back into the business goals and objectives. The blending of the IT management side, Business Service Management and Business Intelligence (specifically Operational BI). Innovation here is needed and may be one way to put the grove back in performance management and monitoring.
I look forward to following your new voice outside EMC!
Doug
January 31st, 2008 at 09:47
Thanks Doug! Your observations are right on the money, as usual!
Your point on the dichotomy of the enterprise and the service provider sides are particularly notable. I’ve always found this separation a bit distressing, but as you note, the differences in scale and maturity are the main reasons. Traditionally, service providers were far ahead of enterprises in their operational sophistication (hmm, I smell another blog posting here!), which forced them to take a different approach.
Enterprises were more enamored by “eye candy” than capability and automation. This is thankfully beginning to change because it must. The complexity and scale of many enterprises rival service providers and the stronger focus on operational efficiency (a la ITIL) is accelerating their transformation into internal service providers.
This is great news, albeit a shift that is painful for many of those involved. No pain, no gain! Besides, it’s far less painful than punitive outsourcing.
February 6th, 2008 at 19:25
Excellent post, Glen. It is really great to see thoughts like these coming out now. A year ago, this was a real education challenge for Integrien when talking with prospects for our technology, but the tide is turning in a big way. Companies are seeking us out now looking for help and are much better educated on what real time, analytics-based, performance management solutions can provide.
As you clearly articulate, the performance management problem can no longer be handled by throwing more people at it and relying on static threshold-based alerting and tribal knowledge-based correlation. Of course, new technologies like virtualization and SOAs, which provide undeniable benefits, compound the existing management problem tremendously. A new approach to performance management is the only way to scale in the face of this growing complexity. You’ve really captured the key elements required to address the problem here: dynamic threshold-based alerting and real time correlation of alert and metric behavior.
Too many IT Operations teams are mired in monitoring events and spend tremendous amounts of manual effort figuring out what alerts are important and which they can ignore. Correlation is limited to simple rules based on tribal knowledge. I was talking to a VP, IT Operations at a large online financial who told me that he had 10 highly paid individuals on his team who do nothing but look at graphs all day. They are comparing week over week metric behavior to identify abnormalities and manually correlating them to solve performance issues. I’ve since seen them doing this, printing out graphs of metric data and holding them up to the light to compare them. It is a slow process that seems archaic, but it is common practice in many Operations teams I’ve talked with. This particular team needs to scale their application infrastructure by a factor of three over the next two years while keeping budget relatively flat. Without automation of these manual tasks, there is no way that will be possible. Of course, this is why they are looking at Integrien’s solution.
I look forward to continued discussion on the differing approaches to analytics-based performance management and how adoption and operationalization are proceeding.
Steve
March 12th, 2008 at 09:42
Thanks, Steve (and sorry for the long delay!),
I have also noticed the changing tide toward a smarter way to understand and address performance problems. The trend has been slow, but it is accelerating because people have no choice. Traditional methods and tools cannot possibly fulfill the need.
It is good to see new technology solutions - and new thinking - beginning to take over!
– Glenn –