Look for Susan's Columns

Posted by Susan on 5:44 PM Thursday September 18, 2008 under

You arrive early one morning only to discover that a mission critical system, the one that supports the fulfillment of inventory in support of your retail operations, is DOA - again.

And, as the system owner, it's your problem to solve.

It's a sobering reality that many companies are at major risk due to the fragile technologies that support operations. System downtime or degradation extracts a huge financial toll; up to 3.6% of revenue according to one 2004 study.

And no one is to blame. Every system requires modifications over time. Since change happens gradually, new functionality and technologies will be bolted on and, eventually, the system will become more and more complex and unstable.

You've convened meetings with various overworked IT specialists to discuss how to improve the health of the system. Like physicians diagnosing a rare condition, there's lots of hemming and hawing and milling around smartly. Justifiably so, the specialists raise serious concerns about jumping in and trying to fix the system since changes to one part of the system can have cascading, unanticipated impacts to others.

The conversations with the IT specialists swirl. There's prescriptions of design modifications, rewrites, service level objectives, applications monitoring tools, and process improvements, but the ideas never settle down into a logical treatment plan.

And none of these prescriptions addresses the need to decrease outages to a tolerable level.

Avoid this scenario. You don't want to get lulled in to buying into a long term wellness program without first stabilizing the existing system, just as you wouldn't start start a running regimen after learning of a heart condition.

You don't need meetings - you need a mandate and a team of experts focused on your case.
You need the organizational equivalent of an IT emergency room.

Here's what to do.

Appeal to the powers-that-be to organize a dedicated cross-functional team of IT and business specialists.

Assign business experts to work side-by-side with IT experts, hand selected from the various IT specialties, including architecture, development, infrastructure engineering and the help desk.

Ensure that this team reports to a seasoned IT executive who, in turn, reports directly to you and the CIO.

Once this team is in place, make sure they focus on the following four imperatives:

1. Start evaluating the symptoms (aka, outages), documented in incident reports available to everyone on the team. Ideally, the incidents should be funneled through the help desk for initial diagnosis and escalation. However, since many organizations don't have trained help desk personnel and disciplined incident management processes, ensure that these calls go to the team and that the they document the issues and the band-aids they applied to get the system up and running.

2. Document the business process, applications, data, and infrastructure architectures. The team needs to have a common, big-picture view of what the system does and how it's built. This entails analyzing the business process and mapping the process to the underlying applications and data and then mapping the applications and data to the underlying technologies. Without this perspective, it's impossible to diagnose and rectify issues. You'll quickly discover how remarkable (and scary) it is just how little is known about systems that have been around for years, doing really important things.

3. Conduct differential diagnosis. Analyze the incidents to identify problems and prioritize fixes based on business impact. Over time, the team will see patterns emerge (e.g., problems occur at month end, when volume reaches certain levels, when data contains certain values, etc.) and root cause analysis will focus on identifying changes that will reduce the frequency and duration of the outages.

4. Implement the changes. Protect against the likelihood of introducing more problems and instability by validating the changes by testing in an environment that mirrors the one driving the production system.

Continue steps 2 through 4 until the outages reach an "acceptable" level.

Disband the dedicated team once an ongoing "wellness" program is defined that ensures regular monitoring of system performance so that issues can be escalated and quickly addressed.

Finally, start developing a business case to justify new systems to replace the problematic ones. Keep in mind that the goal is not to replace the existing system in kind, but to identify business process changes that will provide fundamental improvements to business, as well as systems, performance. 


Add a Comment

Follow Valuedance