IT Operational Excellence: Lone Ranger to NFL to CSI or is it marching band

Blue Square

In 1993, Frederik Wiersema, et. al, wrote in Harvard Business Review, their piece on Customer Intimacy, Operational Excellence, and Product Leadership.   IT Operations departments commonly focus on Operational Excellence.  And Change Management tends to be a common thread to deal with avoiding operational issues that arise during maintenance windows.

My intention was to quote statistics on human error during maintenance windows, but I found that the statistics being too specific to disciplines (e.g. telephony, data center).  So, trust me when I say that it is easy to envision that managers would prefer that there be less human error than average during maintenance windows or other types of change.  Certainly, downtime would wish to be avoided.  Microsoft did a good job explaining types of downtime.

I used to hear stories of C-level execs saying after an outage, “We have the navy training cadets to operate nuclear submarines, so why can’t we get IT professionals not to cause outages?”

Let me start with how bureaucracies are formed.  Organizational maturity requires different skill sets.  Until there is enough organizational size, there is unique knowledge and thus the Lone Rangers emerge (forgive the oxymoron of a plural Lone Rangers).

Starting off, there needs to be an expert, Lone Ranger, who still might be a jack-of-all-trades.  “Hey, we need someone to do <blank>”.  At this point, there isn’t much operational rigor as that organization probably is not too sophisticated.  It is possible that the person who is responsible doesn’t even write anything down, they just execute when need be.  They evaluate risk, evaluate the solution, and decide.

Next, another person is added to the responsibility of the technology.  At this point, coordination may just be yelling over the cubicle wall – “Hey, I’m going to change this.”

As more people are added, the change management becomes a bit more sophisticated, as multiple people need to be notified.

Then the enterprise becomes more complex with more users, more dependencies, and/or more interactions.  So, change control now comes into place.  The Lone Ranger mentality no longer works.  “Is risk assessed properly?”  “Who is responsible and is that up to their pay grade?”

Enter the CSI Lab Technician.

It could be after the environment has grown, or it could be after the organization as entered a new audit scope that significant operational rigor is added.  When a company falls under audit scope, for instance Sarbanes Oxley (SoX) or Payment Card Industry (PCI) or Health Information Portability and Accountability Act (HIPAA) then more rigor must be applied.  Another body (usually the auditor) is trying to ensure that all the requirements are being performed to a certain standard.

In “CSI: Crime Scene Investigations”, one sees the scientists in the lab analyzing trace evidence and they are usually under some pressure to analyze the sample because it is from the suspect in the interrogation room that they’ve been chasing all day.  Well, in real life, I doubt that the lab techs know the names of whom they are sampling – because they need to maintain neutrality and not be biased.  Because bias tends to get things thrown out in court, because there are legal standards.  Also, for legal scrutiny, there are standard procedures for handling evidence.  For the chemist, there are standard procedures on how samples are placed under the microscope, so that they aren’t dropped or contaminated.

I worked with a former chemist who transferred into IT.  I’d want him to switch between Excel and Word.  Rather than have them up simultaneously and task switch between them, he would go through the same routine:  File/Save.  File/Close.  File/Exit.  Then open the next program.  I could accept his concerns for RAM shortage given his vintage of hardware – but I struggled to be patient.  “You could just click the ‘x’ and it’ll prompt you to save, then it will close it out”.  “Yes, but I feel more comfortable doing it this way.”  An adherence to procedure, provided comfort.

Prior to this, I mentored two student workers.  One was a Computer Science major, the other a Biology major.  They were both very good.  I was always entertained with handing them the same hard problem to solve.  The computer science major was very intuitive in his problem solving — randomly trying different solutions based upon hunches and feel.  The biology major would attack problems very sequentially – trying the most frequent solution to similar problems first, then the next, and so on.

In my experience, computer programmers and engineers are much more geared to their careers because of the problem solving aspect of the jobs.  What has made them successful through college and early part of their careers has been the Lone Ranger aspect:  Identify the problem quickly and solve the problem.  But, now with rigorous change control, the organization is looking for methodical, repeatable, standardized solutions.  There ends up being an incongruity between the personality of the normal IT worker and the job to be performed.

In The leadership pipeline: how to build the leadership powered company – Ram Charon, Steve Drotter, and Jim Noel discuss that when individuals move from leadership tier to leadership tier (individual contributor to manager to director then higher) that the person needs to utilize different skills at each tier — and not use the skills that helped them succeed at the last one.   In a similar vein, I posit that when significant changes come to an operating environment, IT workers and IT teams need to modify their skill sets to be provide Operational Excellence.

When such changes are mandated, of course, it is important that teams be supplied with the resources necessary to be successful whether that be training or equipment.  And managers need to identify that the responsibilities have changed and communicate that to their staff accordingly.

Enter the football game

When one watches the NFL, it seems that even though these professionals who are paid 6 or 7 or 8 figures a year, you will still see dumb penalties.  These players have probably played football since Pop Warner as a youth, yet you still see the occasional 12 men on the field penalties by the defense prior to a field goal attempt.  How hard is it to get the right personnel on the field?  Or how hard is it for the offensive line not to false start – they know the signal for the ball snap.  So, there are still mental errors by professionals that occur.  [I drafted this before the last AFC leading Broncos football game where they were caught with 12 men on the field 3 times!  Once they avoided the penalty by calling a timeout before getting penalized.]

An NFL football game has changes on every play:  Different formations, different routes, and different yardage goals.  And during the snap count, maybe the quarterback changes the play because he doesn’t like the defense that he sees.  When things go bad after the snap, receivers may have to break off routes.  Lots of change – every single play.  And it doesn’t always go right.

Alternatively, there are the halftime routines.  For high school & college, there are the marching bands.  Everyone has their own place and may have unique music.  Zero improvising is required, as all of this is planned out ahead of time.  See this video for an example of the coordination required:

Both the football game and the halftime routines require much practice.  The difference is where is improvising required?  The trick for the Operational Excellence in IT, is to ensure that maintenance windows have more rehearsal and less improvising and that there is time to practice.  That rehearsal and discipline may be contrary to methodologies of some IT workers.

I also recognize that discipline to rehearse and to duplicate environments is easier said than done – lab environments struggle to perfectly match production and simulated workloads are difficult to match as well, and testing time is also difficult.  However, those organizations that strive to drive human error out of their maintenance events decide it is better to spend on the resources ahead of time, as opposed to reacting after the fact and spending potentially just as many resources post mortem.

Jim – 12/16/13


View Jim Surlow's profile on LinkedIn (I don’t accept general LinkedIn invites – but if you say you read my blog, it will change my mind)


Migrating to the Cloud – Technical Concerns of migrating to an IaaS Cloud

Blue SquareThe thoughts of migrating to the Cloud can be flippant or daunting depending on where you sit on the optimist/pessimist scale. In reality, this is a matter of proportion to your environment.

In this week’s post, I’m talking specifically about Infrastructure-as-a-Service Cloud — rather than having a physical presence, your goal is to move to the cloud, so you don’t have to care for that hardware stuff.

My recommendation on where to begin is to ask the question: How would I migrate somewhere else?

It starts by – what services do I move 1st? When I worked at UC Irvine‘s School of Humanities in the early 90s, we had to move into a new building and the finance staff needed to move 1st since they didn’t want to get caught with closing the books at the very time we had to evacuate the old modular building. So, a server had to go over there to provide the Netware routing that we were doing between a classroom network and the office network (it was summer, so I didn’t have to worry about student congestion on the network – though the empty classroom I put the server in was victim of the painters unplugging it). After the office staff could move, then I could bring the office Netware server over one evening. The important part of this story is that I needed networking to handle the people 1st.

Another move that I performed was similar. I had to move servers from downtown Denver to a new data center in south Denver. The users of those machines couldn’t deal with the network latency as our route went from downtown Denver to Massachusetts to south Denver. Those users had to move, then they had to get new AD (Microsoft Active Directory) credentials and new security tokens. So, the important part here is that the users needed an authentication infrastructure 1st.

While moving some servers from Denver to Aurora, into a new facility for us – we again were concerned about latency, so we needed to have authentication and name services also stood up in Aurora, so that authentication wouldn’t have to cross the WAN.

My point from these anecdotes is it is not just as simple as moving one OS instance. There are dependencies. Typically, those dependencies are infrastructure dependencies, and they typically exist so that latency can be avoided. [I haven’t defined network latency — but, for those who need an example — think of TV interviews that occur when the anchor is in New York and the reporter is in the middle east. The anchor asks a question and the reporter has to wait for all the audio to get to him while the viewer sees a pause in conversation. That is network latency. The amount of time it takes to travel the “wire”].

Back to dependencies – I may need DNS (domain name service) at the new site, so that every time I look up, I don’t have to have the server talk back to my local network to get that information. I may need authentication services (e.g AD). I may need a network route outbound. I may need a database server. Now, these start to add up.

In my experience, there is a 1st wave – infrastructure services.

Then there is a 2nd wave – actual systems used by users. Typically, these are some guinea pigs which can endure the kinks being smoothed out.

Eventually, there are a bunch of systems that are all interrelated. This wave ends up being quite an undertaking, as this bulk of systems takes time to move and users are going to want minimal downtime.

Then after the final user wave, the final clean-up occurs – decommissioning the old infrastructure servers.


What I’ve presented is more about how to do a migration than a migration into the cloud. For the cloud, there may be additional steps depending on your provider – maybe you have to convert VMware VMs using OVFtool and then import.

VM portability eases the task. The underlying hardware tends to be irrelevant – as opposed to moving physical servers where there may be different driver stacks, different devices, etc. Obviously, one has to be cognizant of compatibility. If one is running IBM AIX, then one must find a cloud provider that supports this.

My point is that it is still a migration of how to get from A to B, and high level requirements remain the same (How is my data going to move – over the wire or by truck? What can I live with? What systems depend on what other systems?). The big difference between an IaaS Cloud migration and a physical migration is that servers aren’t moving from site A to site B – so there isn’t the “swing gear” conversation or the “physical move” conversation. This is a migration of landing on pre-staged gear. The destination is ready. Figure out the transport requirements of your destination cloud and get going!

Business considerations for the move to the cloud

Blue SquareMigration implies change and change implies risk. So, what are the hurdles that the decision maker has to make before committing to a migration to the cloud?

First, what type of migration is it? Is it a migration to Infrastructure as a Service (IaaS), Platform as a Service (PaaS), or Software as a Service (SaaS) … or any of the other “fill in the blank as a Service” (XaaS)? Wikipedia can provide sufficient definitions for IaaS, PaaS, and SaaS, but just to quickly provide examples: IaaS allows one to hotel their computing environment – e.g. run Microsoft Server on someone else’s gear by renting it out. PaaS allows for a development environment to produce software on someone else’s gear and use their software development tools. SaaS allows one to run a specific software app on someone else’s environment — “webmail” being SaaS before there was a term for it. Now, it could be online learning,, etc.

IaaS, PaaS, SaaS

IaaS, PaaS, SaaS

Second, what are the risks? In exchange for Capital Expenses and some Operational Expenses, one gets Operational Expenses. This also means that some control is turned over to the service. When I lose power to my house, since I haven’t built my own power plant, I’m at the mercy of the utility company. Power comes back when it comes back. I can’t re-prioritize tasks that the power company has set (e.g. bring my neighborhood back before the other neighborhood). Depending on the SLAs – Service Level Agreements – uptime, performance, etc. is where the expectation is set.

I’ve worked with some users when approached by the SLAs of internal systems – wanted to drive costs down. “Oh, I don’t need redundancy or highly available systems – these are test & development servers… except right before we do a code release, then the systems have to be up 24×7.” “Um, you don’t get to pick the time of your disaster or failure, so sounds like you need to buy an HA system.”

As systems become more complex, firms struggle with: “how is the expertise maintained?” Acquisition cost of gear is about 1/3 the total cost of gear. There is maintenance and then the administration. Unless one runs a tech company, the tech administration is not the company’s core competency. So, why would a company want to run that in their business?

This is the classic buy v. build decision. Of course, with IT, the problem is that after one builds, they still have to administer. And, after one buys, they still have to handle the vendor relations.

In addition to vendor relations, one has the concern about vendor longevity. Is the vendor going to be there for as long as your company needs it to be? What happens when the vendor goes out of business or ends the line of business?

Of course, on the build side, what happens when the expert you hired, finds a new job or you wish to promote him to an alternate position?

Non-profits have alternate problems where funds may not be regular and OpEx costs infinitum might not be serviceable. But, hardware/software maintenance costs and training fall in the same boat.

A third consideration is security. How secure is your data in the cloud? Returning to the SaaS e-mail, it is fair to assume given recent revelations that the NSA is mining your e-mail off Gmail, yahoo mail, Hotmail, and others just to name a few. One would hope that the systems are secure from hackers and this info is only leaking to the government lawfully. But, if you are concerned about hackers, how secure is your data in-house? So, there is a cost consideration for the build solution and there is a trust consideration given one’s provider.

The build v. buy decision is admittedly harder with technology given the high rate of change. This is especially true as it ties to security. Feature implementation is based upon service provider timetables and evaluation of risk. All this again returns to priorities and that in the build solution, one gets to make their own calls and evaluations.

In summary, one can select at what level they wish to move to the cloud. One needs to be concerned about the build v. buy decisions, but the cloud move could allow for granular cloud moves (we put this out there, we don’t put that). Security, Vendor Longevity, Vendor Relations, etc. are big factors. Time & Labor needs to be accounted for, doing it in-house or working to out-source. And, of course, there is the decisions about CapEx & OpEx.



<a href=”″&gt;

<img src=”; width=”160″ height=”33″ border=”0″ alt=”View Jim Surlow’s profile on LinkedIn”>


Replication Methodologies

Blue SquareFrequently, when one tries to determine a disaster recovery transfer methodology, one gets stuck on the different methods available as they each have their strengths and weaknesses.  I’ll ignore backups for the time being – backup images can be tapes shipped by truck from primary site to recovery site, or backups replicated by backup appliances.  The focus will be on replication of “immediately” usable data sent over the network from primary site to recovery site.

Types of replication:

  • Application Replication – Using an application to move data.  This could be Active Directory, Microsoft SQL Server, Oracle DataGuard, or other application specific methods
  • OS Replication – This typically uses some application within the Operating system and used to ship files … this could be something as simple as rsync
  • Hypervisor Replication – For those that run virtualized environments, replication occurs at the hypervisor layer.  This could be VMware vSphere Replication
  • Storage Replication – This is when an array replicates the data.


With Application Replication – one can be confident that the application is replicated in a usable form.  Without quiescing the applications, applications may not be entirely recoverable.  (quiescing is the process of placing the application in a state that is usable – usually, this consists of flushing all data in RAM to disk).  However, the question that inevitably comes up is:  “What about all the other systems I will need at the recovery site?”

With OS Replication – the application that moves the data, tends to be fairly simple to understand:  copy this directory structure (or folder) to the receiving system and place in such and such folder.  However, with this methodology, the question that arises is:  “What about the registry and open files?”  If the process just sweeps the filesystem for changes, what happens to files that are open by some other application (i.e. not quiesced — could be a problem with database servers)?

With Hypervisor Replication (for virtualized environments) – The hypervisor sits underneath the guest operating system(s).  This has the advantage of catching the writes from the guest.  To write to disk, the guest has to write to the hypervisor and then the hypervisor writes to disk.  The data can then be duplicated at this level.  For this method, the inevitable question is:  “What about by physical machines that haven’t been virtualized?”  One can still have issues with non-quiesced apps.

With Storage Array Replication – The array sends to its peer at the remote site.  The advantage of this method is that it can handle virtual or physical environments.  The disadvantage is that an array on the remote side must match the protected site.  This can be problematic, as frequently, sophisticated arrays with this technology tend to be costly.

Regardless of the methodology, there has to be coordination between the two sides.  The coordination tends to be handled by having whichever facility at each end provided by the same vendor:  e.g. SQL server to SQL server, VMware vSphere Replication appliances to vSphere Replication appliances, Storage vendor like NetApp to NetApp.  [I’m ignoring Storage Hypervisors and just bundling those into the Storage Array layer.]

Every method has its own set of rules.  Imagine I had a document where I wanted a copy across town.  I could:

  1. walk it over myself – but I would have to follow the rules of the road and not cross against red lights
  2. e-mail it – but I would need to know the e-mail address and the recipient would need to have e-mail to receive it.
  3. fax it – but I would need a fax machine and the recipient would need a fax machine and I would need to dial the number of the recipient’s fax machine.
  4. send it with US postal – but I would need to fill out the envelope appropriately, with name, address, city, state.
  5. call the recipient and dictate it – but I would need the phone number and the recipient needs a phone.  I would also need to speak the same language as the person on the other end of the phone.

These simplistic examples that we use every day, are used here to illustrate that there are certain norms that we are so familiar with that we ignore that they are norms.  But, without these norms, the communication could not be done.  Likewise, users should not be surprised that each data replication method has its own requirements and that the vendors are picky as they wish to define the method of communication.


Rev 1.0 06/25/13