IT Operational Excellence: Lone Ranger to NFL to CSI or is it marching band

Blue Square

In 1993, Frederik Wiersema, et. al, wrote in Harvard Business Review, their piece on Customer Intimacy, Operational Excellence, and Product Leadership.   IT Operations departments commonly focus on Operational Excellence.  And Change Management tends to be a common thread to deal with avoiding operational issues that arise during maintenance windows.

My intention was to quote statistics on human error during maintenance windows, but I found that the statistics being too specific to disciplines (e.g. telephony, data center).  So, trust me when I say that it is easy to envision that managers would prefer that there be less human error than average during maintenance windows or other types of change.  Certainly, downtime would wish to be avoided.  Microsoft did a good job explaining types of downtime.

I used to hear stories of C-level execs saying after an outage, “We have the navy training cadets to operate nuclear submarines, so why can’t we get IT professionals not to cause outages?”

Let me start with how bureaucracies are formed.  Organizational maturity requires different skill sets.  Until there is enough organizational size, there is unique knowledge and thus the Lone Rangers emerge (forgive the oxymoron of a plural Lone Rangers).

Starting off, there needs to be an expert, Lone Ranger, who still might be a jack-of-all-trades.  “Hey, we need someone to do <blank>”.  At this point, there isn’t much operational rigor as that organization probably is not too sophisticated.  It is possible that the person who is responsible doesn’t even write anything down, they just execute when need be.  They evaluate risk, evaluate the solution, and decide.

Next, another person is added to the responsibility of the technology.  At this point, coordination may just be yelling over the cubicle wall – “Hey, I’m going to change this.”

As more people are added, the change management becomes a bit more sophisticated, as multiple people need to be notified.

Then the enterprise becomes more complex with more users, more dependencies, and/or more interactions.  So, change control now comes into place.  The Lone Ranger mentality no longer works.  “Is risk assessed properly?”  “Who is responsible and is that up to their pay grade?”

Enter the CSI Lab Technician.

It could be after the environment has grown, or it could be after the organization as entered a new audit scope that significant operational rigor is added.  When a company falls under audit scope, for instance Sarbanes Oxley (SoX) or Payment Card Industry (PCI) or Health Information Portability and Accountability Act (HIPAA) then more rigor must be applied.  Another body (usually the auditor) is trying to ensure that all the requirements are being performed to a certain standard.

In “CSI: Crime Scene Investigations”, one sees the scientists in the lab analyzing trace evidence and they are usually under some pressure to analyze the sample because it is from the suspect in the interrogation room that they’ve been chasing all day.  Well, in real life, I doubt that the lab techs know the names of whom they are sampling – because they need to maintain neutrality and not be biased.  Because bias tends to get things thrown out in court, because there are legal standards.  Also, for legal scrutiny, there are standard procedures for handling evidence.  For the chemist, there are standard procedures on how samples are placed under the microscope, so that they aren’t dropped or contaminated.

I worked with a former chemist who transferred into IT.  I’d want him to switch between Excel and Word.  Rather than have them up simultaneously and task switch between them, he would go through the same routine:  File/Save.  File/Close.  File/Exit.  Then open the next program.  I could accept his concerns for RAM shortage given his vintage of hardware – but I struggled to be patient.  “You could just click the ‘x’ and it’ll prompt you to save, then it will close it out”.  “Yes, but I feel more comfortable doing it this way.”  An adherence to procedure, provided comfort.

Prior to this, I mentored two student workers.  One was a Computer Science major, the other a Biology major.  They were both very good.  I was always entertained with handing them the same hard problem to solve.  The computer science major was very intuitive in his problem solving — randomly trying different solutions based upon hunches and feel.  The biology major would attack problems very sequentially – trying the most frequent solution to similar problems first, then the next, and so on.

In my experience, computer programmers and engineers are much more geared to their careers because of the problem solving aspect of the jobs.  What has made them successful through college and early part of their careers has been the Lone Ranger aspect:  Identify the problem quickly and solve the problem.  But, now with rigorous change control, the organization is looking for methodical, repeatable, standardized solutions.  There ends up being an incongruity between the personality of the normal IT worker and the job to be performed.

In The leadership pipeline: how to build the leadership powered company – Ram Charon, Steve Drotter, and Jim Noel discuss that when individuals move from leadership tier to leadership tier (individual contributor to manager to director then higher) that the person needs to utilize different skills at each tier — and not use the skills that helped them succeed at the last one.   In a similar vein, I posit that when significant changes come to an operating environment, IT workers and IT teams need to modify their skill sets to be provide Operational Excellence.

When such changes are mandated, of course, it is important that teams be supplied with the resources necessary to be successful whether that be training or equipment.  And managers need to identify that the responsibilities have changed and communicate that to their staff accordingly.

Enter the football game

When one watches the NFL, it seems that even though these professionals who are paid 6 or 7 or 8 figures a year, you will still see dumb penalties.  These players have probably played football since Pop Warner as a youth, yet you still see the occasional 12 men on the field penalties by the defense prior to a field goal attempt.  How hard is it to get the right personnel on the field?  Or how hard is it for the offensive line not to false start – they know the signal for the ball snap.  So, there are still mental errors by professionals that occur.  [I drafted this before the last AFC leading Broncos football game where they were caught with 12 men on the field 3 times!  Once they avoided the penalty by calling a timeout before getting penalized.]

An NFL football game has changes on every play:  Different formations, different routes, and different yardage goals.  And during the snap count, maybe the quarterback changes the play because he doesn’t like the defense that he sees.  When things go bad after the snap, receivers may have to break off routes.  Lots of change – every single play.  And it doesn’t always go right.

Alternatively, there are the halftime routines.  For high school & college, there are the marching bands.  Everyone has their own place and may have unique music.  Zero improvising is required, as all of this is planned out ahead of time.  See this video for an example of the coordination required:

Both the football game and the halftime routines require much practice.  The difference is where is improvising required?  The trick for the Operational Excellence in IT, is to ensure that maintenance windows have more rehearsal and less improvising and that there is time to practice.  That rehearsal and discipline may be contrary to methodologies of some IT workers.

I also recognize that discipline to rehearse and to duplicate environments is easier said than done – lab environments struggle to perfectly match production and simulated workloads are difficult to match as well, and testing time is also difficult.  However, those organizations that strive to drive human error out of their maintenance events decide it is better to spend on the resources ahead of time, as opposed to reacting after the fact and spending potentially just as many resources post mortem.

Jim – 12/16/13


View Jim Surlow's profile on LinkedIn (I don’t accept general LinkedIn invites – but if you say you read my blog, it will change my mind)