Bandwidth v. Latency or Highway Lanes v. Distance

Green Ball
Today, the Fibre Channel Association announced its Gen 6 spec with 128 gigabits per second.  For many, that sentence may not make any sense.  For the benefit of my readers, I really do try to simplify things.

So, let me start over.  Connectivity to disk or connectivity over the network is measured in speeds of gigabits per second (gbps); it used to be in mbps.  An important note is that these are “bits”, not “bytes” which are 8 times larger.  So, when you hear a network is 1gb, it is 1 gigabit per second, and thus 125 megabytes per second.  Networks tend to be measured in bits, as opposed to files which are measured in bytes.

Now, that I went on a tangent, let me come back again.  So, when you hear that you have a 1gb ethernet port on your PC or laptop, or that you wireless router does 54mb, this is the size of the pipe.  This does not mean that you can push that much data.  Also, the speed of the network isn’t the only thing.

As an example, I was working with someone recently who was trying to compare 10gb ethernet to 8gb fibrechannel (both ethernet & fibrechannel are networking protocols, the latter is exclusively for disk access).  “Well, 10 is greater than 8, so that is better”.  I respond, “well, I don’t know if you are capable of pushing data that fast.”  (I didn’t even want to get into the discussion of potential redundant pipes).

When the network is dedicated to a server, it is less important than when the network is shared.  “Oh we have <blank> gb for our network, why are things so slow?”  Well, the best way to explain it for shared network is to think in terms of lanes on a highway.  Just because I have lanes on the highway, that doesn’t mean I get to go faster.  When networks are slower and multiple users, then there is the possibility of more congestion.   So, greater bandwidth will make things go faster.

But, if there is no congestion, then there are still limits to how fast packets go from point A to point B.  That is “latency”.  If I want to go from one part of town to another part of town, then it doesn’t matter if there is little traffic or no traffic, it still takes time to go from A to B.  If there is lots of traffic, then yes, it does matter.  But, that is when and why it feels that more bandwidth is better.

20140212-223017.jpg

The lesson is not to confuse bandwidth with latency.  It also means that one shouldn’t get too excited when they hear the new speeds of networks.

Jim – 02/12/14

@itbycrayon

View Jim Surlow's profile on LinkedIn (I don’t accept general LinkedIn invites – but if you say you read my blog, it will change my mind)

IT Operational Excellence: Lone Ranger to NFL to CSI or is it marching band

Blue Square

In 1993, Frederik Wiersema, et. al, wrote in Harvard Business Review, their piece on Customer Intimacy, Operational Excellence, and Product Leadership.   IT Operations departments commonly focus on Operational Excellence.  And Change Management tends to be a common thread to deal with avoiding operational issues that arise during maintenance windows.

My intention was to quote statistics on human error during maintenance windows, but I found that the statistics being too specific to disciplines (e.g. telephony, data center).  So, trust me when I say that it is easy to envision that managers would prefer that there be less human error than average during maintenance windows or other types of change.  Certainly, downtime would wish to be avoided.  Microsoft did a good job explaining types of downtime.

I used to hear stories of C-level execs saying after an outage, “We have the navy training cadets to operate nuclear submarines, so why can’t we get IT professionals not to cause outages?”

Let me start with how bureaucracies are formed.  Organizational maturity requires different skill sets.  Until there is enough organizational size, there is unique knowledge and thus the Lone Rangers emerge (forgive the oxymoron of a plural Lone Rangers).

Starting off, there needs to be an expert, Lone Ranger, who still might be a jack-of-all-trades.  “Hey, we need someone to do <blank>”.  At this point, there isn’t much operational rigor as that organization probably is not too sophisticated.  It is possible that the person who is responsible doesn’t even write anything down, they just execute when need be.  They evaluate risk, evaluate the solution, and decide.

Next, another person is added to the responsibility of the technology.  At this point, coordination may just be yelling over the cubicle wall – “Hey, I’m going to change this.”

As more people are added, the change management becomes a bit more sophisticated, as multiple people need to be notified.

Then the enterprise becomes more complex with more users, more dependencies, and/or more interactions.  So, change control now comes into place.  The Lone Ranger mentality no longer works.  “Is risk assessed properly?”  “Who is responsible and is that up to their pay grade?”

Enter the CSI Lab Technician.

It could be after the environment has grown, or it could be after the organization as entered a new audit scope that significant operational rigor is added.  When a company falls under audit scope, for instance Sarbanes Oxley (SoX) or Payment Card Industry (PCI) or Health Information Portability and Accountability Act (HIPAA) then more rigor must be applied.  Another body (usually the auditor) is trying to ensure that all the requirements are being performed to a certain standard.

In “CSI: Crime Scene Investigations”, one sees the scientists in the lab analyzing trace evidence and they are usually under some pressure to analyze the sample because it is from the suspect in the interrogation room that they’ve been chasing all day.  Well, in real life, I doubt that the lab techs know the names of whom they are sampling – because they need to maintain neutrality and not be biased.  Because bias tends to get things thrown out in court, because there are legal standards.  Also, for legal scrutiny, there are standard procedures for handling evidence.  For the chemist, there are standard procedures on how samples are placed under the microscope, so that they aren’t dropped or contaminated.

I worked with a former chemist who transferred into IT.  I’d want him to switch between Excel and Word.  Rather than have them up simultaneously and task switch between them, he would go through the same routine:  File/Save.  File/Close.  File/Exit.  Then open the next program.  I could accept his concerns for RAM shortage given his vintage of hardware – but I struggled to be patient.  “You could just click the ‘x’ and it’ll prompt you to save, then it will close it out”.  “Yes, but I feel more comfortable doing it this way.”  An adherence to procedure, provided comfort.

Prior to this, I mentored two student workers.  One was a Computer Science major, the other a Biology major.  They were both very good.  I was always entertained with handing them the same hard problem to solve.  The computer science major was very intuitive in his problem solving — randomly trying different solutions based upon hunches and feel.  The biology major would attack problems very sequentially – trying the most frequent solution to similar problems first, then the next, and so on.

In my experience, computer programmers and engineers are much more geared to their careers because of the problem solving aspect of the jobs.  What has made them successful through college and early part of their careers has been the Lone Ranger aspect:  Identify the problem quickly and solve the problem.  But, now with rigorous change control, the organization is looking for methodical, repeatable, standardized solutions.  There ends up being an incongruity between the personality of the normal IT worker and the job to be performed.

In The leadership pipeline: how to build the leadership powered company – Ram Charon, Steve Drotter, and Jim Noel discuss that when individuals move from leadership tier to leadership tier (individual contributor to manager to director then higher) that the person needs to utilize different skills at each tier — and not use the skills that helped them succeed at the last one.   In a similar vein, I posit that when significant changes come to an operating environment, IT workers and IT teams need to modify their skill sets to be provide Operational Excellence.

When such changes are mandated, of course, it is important that teams be supplied with the resources necessary to be successful whether that be training or equipment.  And managers need to identify that the responsibilities have changed and communicate that to their staff accordingly.

Enter the football game

When one watches the NFL, it seems that even though these professionals who are paid 6 or 7 or 8 figures a year, you will still see dumb penalties.  These players have probably played football since Pop Warner as a youth, yet you still see the occasional 12 men on the field penalties by the defense prior to a field goal attempt.  How hard is it to get the right personnel on the field?  Or how hard is it for the offensive line not to false start – they know the signal for the ball snap.  So, there are still mental errors by professionals that occur.  [I drafted this before the last AFC leading Broncos football game where they were caught with 12 men on the field 3 times!  Once they avoided the penalty by calling a timeout before getting penalized.]

An NFL football game has changes on every play:  Different formations, different routes, and different yardage goals.  And during the snap count, maybe the quarterback changes the play because he doesn’t like the defense that he sees.  When things go bad after the snap, receivers may have to break off routes.  Lots of change – every single play.  And it doesn’t always go right.

Alternatively, there are the halftime routines.  For high school & college, there are the marching bands.  Everyone has their own place and may have unique music.  Zero improvising is required, as all of this is planned out ahead of time.  See this video for an example of the coordination required:  http://www.youtube.com/watch?v=DNe0ZUD19EE

Both the football game and the halftime routines require much practice.  The difference is where is improvising required?  The trick for the Operational Excellence in IT, is to ensure that maintenance windows have more rehearsal and less improvising and that there is time to practice.  That rehearsal and discipline may be contrary to methodologies of some IT workers.

I also recognize that discipline to rehearse and to duplicate environments is easier said than done – lab environments struggle to perfectly match production and simulated workloads are difficult to match as well, and testing time is also difficult.  However, those organizations that strive to drive human error out of their maintenance events decide it is better to spend on the resources ahead of time, as opposed to reacting after the fact and spending potentially just as many resources post mortem.

Jim – 12/16/13

@itbycrayon

View Jim Surlow's profile on LinkedIn (I don’t accept general LinkedIn invites – but if you say you read my blog, it will change my mind)

NetApp 7-mode ssh key config via CLI w/o NFS or CIFS

Double Black DiamondConfiguring NetApp to use SSH with keys without having the root volume holding /etc NFS exported or CIFS shared can be convoluted.

Before I get to the steps, let me list the assumptions:

  1. The steps below will be for a non-root user
  2. Root/Administrator privs are available to the user who is setting this up.
  3. The SSH key for the non-root user has already been generated on the client system.
  4. The SSH key can be done with a copy/paste from something reading the file (e.g. xterm or notepad) into a shell window with the CLI login into the filer (e.g. xterm or puTTY)

Basically, the trick is to setup the empty user directories since there isn’t a command to create directories.  Obviously, with NFS or CIFS, the directory can be made fairly easily.

  1. Login into filer via CLI with appropriate privileges.
  2. # go into advanced mode
    • priv set advanced
  3. # find an empty directory using ls – in some cases, /home/http may be empty.
    • ls /home/http
  4. # check ndmpd status
    • ndmpd status
  5. # if ndmp is not on, turn it on.
    • ndmpd on
  6. # When using ndmpcopy, the shortcut of dropping /vol/<root volume> does not work for the destination
    • ndmpcopy /home/http /vol/<root volume>/etc/sshd/<username>
      ndmpcopy /home/http /vol/<root volume>/etc/sshd/<username>/.ssh
  7. # Create the text file with wrfile and cut and Paste key(s) from your other window, and then ctrl-c
    • wrfile /vol/<root volume>/etc/sshd/<username>/.ssh/authorized_keys
  8. # if ndmpd was off, turn it off.
    • ndmpd off
  9. # ndmpd creates a restore_symboltable file.  For cleanliness, need to remove that.
    • rm /vol/<root volume>/etc/sshd/<username>/restore_symboltable
    • rm /vol/<root volume>/etc/sshd/<username>/.ssh/restore_symboltable

Short Cut (if a user has already been setup then their ssh keys and directory structure could be copied which saves some steps).
Warning: Technically, the permissions (unix or Windows ACLs) are going to follow with the ndmpcopy, so there is a security risk here, if /etc is NFS mounted or CIFS shared. Keep that in mind.

  1. # check ndmpd status
    • ndmpd status
  2. # if ndmp is not on, turn it on.
    • ndmpd on
  3. # When using ndmpcopy, the shortcut of dropping /vol/<root volume> does not work for the destination
    • ndmpcopy /vol/<root volume>/etc/sshd/<exist user with ssh keys>/vol/<root volume>/etc/sshd/<new ssh user>
  4. # Create the text file with wrfile and cut and Paste key(s) from your other window, and then ctrl-c
    • wrfile /vol/<root volume>/etc/sshd/<new ssh username>/.ssh/authorized_keys
  5. # if ndmpd was off, turn it off.
    • ndmpd off
  6. # ndmpd creates a restore_symboltable file.  For cleanliness, need to remove that.
    • rm /vol/<root volume>/etc/sshd/<new ssh username>/restore_symboltable

Jim – 11/18/13

@itbycrayon

View Jim Surlow's profile on LinkedIn (I don’t accept general LinkedIn invites – but if you say you read my blog, it will change my mind)

3 Reasons why military veterans make good employees

Green Ball

Over the years, I have had the pleasure of working with numerous veterans of our armed forces. There are many experiences which I believe are common to the military that transfer over to civilian employment. I have not served in the military, but I believe my experience with former military colleagues and screening many job applicants over the past 20 years allows me to offer an opinion.
Frequently, employers look for experience in a certain industries or environments: Experience in a software development environment, or a service provider environment, a manufacturing environment, a sales environment, a financial services environment, or an academic environment, etc. etc.
So, what does a former soldier have to offer a civilian firm? [ Given that I’ve spent a majority of my years with IT firms, my explanation will be IT slanted ]
#3 Veterans have experience dealing with difficult people. Soldiers are trained to maintain composure in the face of drill sergeants and other superiors. That training is supposed to translate into the field, where the enemies are trying to instigate conflict. Do you think they are going to lose composure when an angry customer is yelling? Do you think that they are going to escalate conflict in the workplace?
I’ve seen two incidents where a manager was yelling at an employee and a response would have been justified. But, in both cases, one with a former Army private and another with an Army Officer who was in the reserves, neither spoke a strong word which would have escalated the situation.
#2 Veterans show a loyalty to the team. In a sense, this is related to the former. The teammates make up the unit. As the saying goes, “there is no ‘I’ in team”. So, team success is important. In the military, if the guy who has your back isn’t there, your future won’t be so bright. Employers want employees to care about the company’s success. Directors and VPs want to see teams that are successful, not just individuals. Heroes are good, but companies want to know that they can execute without them. In addition, managers are concerned about team chemistry. Guys who aren’t interested in team success tend to work against team chemistry.
I worked with a manager who had previously come from the Air Force (if memory serves that was the branch). He was loyal to the staff he inherited. He backed them up. He assumed responsibility for the team’s performance and was intent on getting the team to function together.
#1 Veterans are resilient to difficult times. In the workplace, change is frequent. In business, if you don’t change, you will be out of business: refine the organization, race to market, respond to competitors, personnel changes, new regulations, buyouts, spinoffs, etc. etc. Some of the changes or even rumors of changes can be overwhelming to staff. In business, projects can be started, stopped and then restarted – or direction switched and switched back. Military staff are trained to prepare for change. Just as complete information may not be available to staff, military staffers are used to having incomplete info and knowing that those above may have more information to make decisions, as opposed to what is public. In addition, conditions that soldiers are placed under are more stressful and more life impacting than what happens in the typical civilian job.
I worked with a former marine who while under a great deal of pressure to deliver. The project was important and it was behind on the timelines and had a fair amount of attention. He said something along the lines of: “Hey, compared to rolling in a tank through Fallujah (Iraq) and being shot at – this is pretty easy.”
The net result is that veterans bring commitment without anxiety. That is a value to any organization.

Jim – 11/10/13
@itbycrayon

View Jim Surlow's profile on LinkedIn (I don’t accept general LinkedIn invites – but if you say you read my blog, it will change my mind)

How to deal with exponential growth rates? And how does this relate to cloud computing?

Double Black DiamondWhat happens when demand exceed the resources? Ah, raise prices. But, sometimes that is a not available as a solution. And sometimes demand spikes far more than expected.

Example: Back in the early 2000s, NetFlix allowed renters to have 3 DVDs at a time, but some customers churned those 3 DVDs more frequently than average and more frequently than Netflix expected. So, they throttled those customers and put them at the back the line. (dug up this reference). This also appears to have happened in their streaming business.

Another example: Your web site gets linked on a site that generates a ton of traffic (I should be so lucky). This piece says that the Drudge Report sent 30-50,000 hits per hour bringing down the US Senate’s web site. At 36,000, that is an average of 10 per second.

Network Bandwidth tends to be the resource. Another example from AT&T: As a service provider, this piece says that 2% of their customers consume 20% of their network.

There are non-technical examples as well. The all-you-can-eat buffet is one. Some customers will consume significantly more than the average. (Unfortunately, I can’t find a youtube link to a commercial that VISA ran during the Olympics in the 80s or 90s where a sumo wrestler walks into a buffet – if you can find it for me, please reply).

Insurance customers deal with this as well. They try to spread out the risk so that if an event were to occur (e.g. a hurricane), they don’t want all their customers in a single area. Economists call this “adverse selection”. “How do we diversify the risk so that those that file claims, aren’t the only ones paying in?”

How does this deal with computing? Well, quotas are an example. I used to run systems with home directory quotas. If I had 100GB, but 1000 users, I couldn’t divide this up evenly. I had about 500 users who didn’t need 1MB, but I had 5 that needed 10GB. For the 500 users that did need more than 1MB, they needed more than an even slice.

So, the disk space had to be “oversubscribed”. I then could have a situation where everyone stayed under quota, but I could still run out of disk space.

Banks do this all the time. They have far less cash on-hand in the bank, than they have deposits. Banks compensate by having insurance through the Fed which should prevent a run on the bank.

In computing, this happens on network bandwidth, disk space, and compute power. At deeper levels, this deals with IO. As CPUs get faster, disks become the bottleneck and not everyone can afford solid state disks to keep up with the IO demand.

The demand in a cloud computing environment would hopefully follow a normal distribution (bell curve) for demand. But, that is not what always occurs. Demand tends to follow an exponential curve.

20131103-211158.jpg

As a result, if the demand cannot be quenched by price increases, then throttling must be implemented to prevent full consumption of the resources. There are many algorithms to choose from when looking at the network, likewise there are algorithms for the compute.

Given cloud architecture which is VM on a host connected to a switch connected to storage which has a disk pool of some sort, there are many places to introduce throttles. In the image below which is uses a VMware & NetApp vFiler environment (could be SVM aka vServer as well) serving, there is VM on ESX host, connected to Ethernet switch, connected to Filer, split between disk aggregate and a vFiler which then pulls from the volume sitting on the aggregate, and then has the file.

20131103-211311.jpg

Throttling at the switch may not do much good. As this would throttle all VMs on an ESX host or if not filtering by IP, all ESX hosts. Throttling at the ESX server layer again, affects multiple VMs. Imagine a single customer on 1 or many VMs. Likewise, filtering at the storage layer, specifically, the vFiler may impact multiple VMs. The logical thing to do for greatest granularity would be to throttle at the VM or vmdk level. Basically, throttle at the end-points. Since a VM could have multiple vmdks, it is probably best to throttle at the VM level. (NetApp Clustered OnTap 8.2 would allow for throttles at the file level). Not to favor NetApp, other vendors (e.g EMC, SolidFire) who are introducing QoS are doing these at the LUN layer (they tend to be block vendors).

For manual throttling, some isolate the workloads to specific equipment – this could be compute, network, or disk. When I used to work at the University of CA, Irvine and we saw the dorms coming online with Ethernet to the rooms, I joked that we should drive their traffic through our slowest routers as we feared they would bury the core network.

The question would be what type of throttle algorithm would be best? Since starving the main consumers to zero throughput is not acceptable, following a network model may be preferred. Something like a weighted fair queueing algorithm may be the most reasonable, though a simple proposition would be to revert back to the quota models for disk space – just set higher thresholds for many which will not eliminate every problem, but a majority. For extra credit (and maybe a headache) read this option which was a network solution to also maximize throughput

Jim – 11/03/13
@itbycrayon

View Jim Surlow's profile on LinkedIn (I don’t accept general LinkedIn invites – but if you say you read my blog, I’ll accept)

 

 

3 Reasons why techies hate being summoned to meetings run by non-technical staff

Green Ball

Techies tend not to enjoy meetings. And especially those run by non-technical staff. And more so, when they are summoned to them – “oh, we need you at that meeting”. Here are 3 explanations as to why:

#3 – Communication Gap – Lingo, jargon, idioms, whatever, is somewhat localized to the technical staff. Frequently, the non-technical person calling the meeting doesn’t understand the lingo. So, there is a communication gap. I’ve been on conference calls with peers in other countries and there were language barriers. I’ve been in meetings with people on the same team and there have been language barriers. System Administrators, Engineers & Architects have different languages than Salespeople, Project Managers, and Execs. This makes meetings sometimes painful. As people talk past each other or after lengthy dialogue meetings get longer.

#2 – Negotiation v. Conversation – Questions come in that say, “Isn’t it possible to do ?” And the answer is “Yes, but…” Unless the conditional phrase is put into terms that the audience can really grasp, the condition isn’t really heard. If the solution proceeds and negative consequences result, then the assumption is that the warnings were ignored. Meanwhile, it is stated the “expert” was in the room. So, blame ends up as a result. The, “can’t we do” question is a negotiation from the ones who need the solution, the “yes, but” answer tends to be a conversation. Since there usually is a technical solution to most problems and the question typically is interpreted as to what is possible, the answer almost always is “yes, but”. The answer should tend toward, “No, unless you have more dollars in the budget” or “No, unless you have more labor to provide me.”

#1 – Information Direction – Technical staff either have to research solutions or execute those solutions. They are information producers. “The solution will look like this … ” or “It will take this long to run…” While non-technical staff tend toward being information consumers – maybe they are decision makers (managers or executives) or maybe they are project managers needing to setup schedules. So, they need the technical staff to provide the information to the other stakeholders. While they are in these meetings waiting to supply information, they can’t be off “doing their job” of researching solutions or executing on those solutions. It is especially painful when they are in the meetings, waiting to contribute, and the question arises, “why are you so far behind?” or “when will it be done?” Reminds me of a Dilbert comic where the Pointy-Haired-Boss asks Dilbert for daily status updates as to why he is so far behind. Apparently, Scott Adams has two on the topic.

I focused this post on technical staff at non-technical meetings. When technical staff are at technical meetings, there tend not to be communication gaps nor negotiations and the information direction changes where they can also be information consumers rather than sole providers.

How do you make the meetings more effective?

#1 – Translate the information – Try to drive the information to the stakeholder’s concern, try to get a translation. Move the conversation from “if this happens, the port is down” to “if this happens, the customers can’t get data”, or “if this solution doesn’t work, I’ll need to go back to the drawing board.” to “if the solution doesn’t work, I’ll probably need another 2 months to find another way.”

#2 – Detail the requirements and/or assumptions – Instead of “can’t we do this?”, it should be rephrased to “can’t we do this, with the existing budget and existing schedule and existing staff?” (or whatever adjustments to one or all of the 3). Detail the meeting assumptions – at the meeting, I’m looking for “information to make a decision”, “information, so that all the attendees have the same base of information”, “timelines of execution”, or “proof information that is presented by ”

Jim – 10/21/13
@itbycrayon

View Jim Surlow's profile on LinkedIn (I don’t accept general LinkedIn invites – but if you say you read my blog, it will change my mind)

Communicating Company Vision

Green BallOne of my top takeaways from business school, was not specifically from one of my MBA classes, but from an article I read related to the study of Apple. Former CEO, John Sculley, clearly and succinctly articulated what type of company Apple was/is. —
“Apple is a designers company, not an engineers company.”

It is important that the executives of a company know who they are and that it can be easily consumed by the employees. This allows everyone to know who they are and can be aligned on company direction.

This week listening to NetApp CEO Tom Georgens at their partner conference, he gave a fantastic speech. (I tend not to be enamored with CEO speeches, but this was impressive enough to justifies a blog post)

He stated what NetApp is as a company. He continued on what is important to the company to drive stock price. On top of that, stated what is important to NetApp’s customers. Then, identified what was required by NetApp to meet those needs.

What was great about this speech, is that employees and partners heard – who they are, what their customers need, how they are going to meet those needs, and how that will drive stock price. The speech heavy with detail, reinforced the points above and potentially overshadowed those key points- but the simplicity of those points was key.

Jim – 10/08/13
@itbycrayon

View Jim Surlow's profile on LinkedIn (I don’t accept general LinkedIn invites – but if you say you read my blog, it will change my mind)

NetApp de-dupe, flexclone, fragmentation, and reallocation

Double Black DiamondSince NetApp has their own filesystem, WAFL, disk layouts and fragmentation behave differently than on traditional filesystems. Before NetApp introduced de-duplication, fragmentation was less common. Real quickly let me deviate a bit and describe some notions of WAFL:

WAFL stands for “Write-Anyway File Layout” The notion being that if inodes could be saved with part of the file, then there wouldn’t need to be a file allocation table or inode table at one part of the volume and then data at another.

Another goal was to minimize disk head seeks – so data through a RAID set would be striped across all the disks in the set at essentially the same sectors. If I have 4 data disks and a parity disk (keeping it simple here), the stripe of data would be 25% on one disk, 25% on the next, and so on, and parity on the last. Assuming I have 100 sectors (for ease of calculation), I could put data on disk one at sectors 0-24 and on the next disk 25-49 and on the next 10-34 and on the next one 95-100 and 0-3, then 88, then 87, then 5-10, etc. Well, that would be rather slow as the head jumps around the disk platter searching for those. NetApp tries to keep all the data in one spot on each disk.

So, data is now assembled in nice stripes and if the writes are large, then a file takes up nice consecutive blocks on the stripe. The writes are in 4k blocks, but with a much larger file, they line up nice and sequential.

From their Technical Report on the subject, they state that the goal is to line up data for nice sequential reads. So, when a read request takes place, a group of blocks can be pre-fetched in anticipation of serving up a larger sequence.

Well, with NetApp FlexClone (virtual clones) and NetApp ASIS (de-duplication), the data gets a bit more jumbled. (Same with writes of non-contiguous blocks). Here’s why:

If I have data that doesn’t live together, holes get created. Speaking of non-contiguous blocks, I’ve worked in areas with Oracle databases over NFS on NetApp. When Oracle does a write of a group of blocks, those may not be logically be grouped (like a text file might be). So, after some of the data is aged off, holes are created as block 1 might be kept, but not block 2 & 3.

With FlexClone, the snapshot is made writeable and then future writes to a volume are tracked separately and may get aged off separately. More clear is the notion of de-duplication. NetApp de-dupe is post-process so it accepts an entire write of a file and then later compares the contents of that file with other files on the system. It will then remove any duplicated blocks.

So, if I write a file and blocks 2, 3, & 4 match another file and blocks 1 & 5 do not, when the de-dupe process sweeps the file, it is going to remove those blocks 2,3, & 4. So, that stripe now has holes and that will impact the pre-fetch operation on the next read.
20130916-221649.jpg

In the past, I only ran the reallocate command when I added disk to the disk aggregate, but now while searching through performance enhancing methods on NetApp, I’ve realized the impact of de-dupe on fragmentation.

Jim – 09/16/13
@itbycrayon

View Jim Surlow's profile on LinkedIn (I don’t accept general LinkedIn invites – but if you say you read my blog, it will change my mind)

VMware Linked Clones & NetApp Partial writes fun


Double Black Diamond
NetApp OnTap writes data in 4k-blocks. As long as the writes to NetApp are in 4k increments, all is good.

Let me step back. Given all the fun that I’ve experienced recently, I am going to alter my topic schedule to bring this topic forward while it is hot.

One more step back: The environment consists of a VMware ESX hosts getting their storage via NFS from NetApp storage. When data is not written to a storage system in increments matching its block size, then misalignment occurs. For NetApp, that block size is 4k. If I write 32k, that breaks nicely into 8 4k blocks. If I write 10k, that doesn’t as it ends up being 2 and a half.

20130830-222125.jpg

The misalignment problems has been well documented. VMware has a doc. NetApp has a doc. Other vendors (e.g. HP/3PAR and EMC) reference alignment problems in their docs. The problem is well known – and easily googled. With misalignment, more read & write operations are required because the underlying block is not aligned with the block that is going out to storage.

And yay! VMware addresses it in their VMFS-5 file system by making the blocks 1MB in size. That will divide up nicely. And Microsoft, with Windows 2008, they changed the starting block which helped alignment.

So, all our problems are gone, right??

NO.

VMware introduced linked clones which have a grain size of 512 (see Cormac Hogan’s blog)

Once this issue is discovered, you end up reading more of Cormac’s blog, and then maybe some of Duncan Epping‘s, and maybe some of Jack McLeod, not to mention Knowledge Base articles from both VMware & NetApp. The recommended solution is to use VAAI and let the NetApp handle clones on the backend. And these 512-byte writes are technically “partial” writes and not “misaligned”.

If everything is aligned, then the partial writes require 1 disk read operation (of 4k), an instruction to now wedge the 512 packet in appropriately to the 4k, and 1 write back out. If misalignment exists, then it requires twice the IO operations.

However, if you look at nfsstat -d, you’ll notice that there are a whole bunch of packets in the 0-511 range. Wait! I have partial writes, those show up in the 512-1k. What are all these?

At this point, I don’t entirely know (gee Jim, great piece – no answers), but according to VMware KB 1007909 VMware NFS is doing 84-byte (84!?) writes for NFS locking. Given the count in my 1-511 bytes, NFS locking can’t account for all of those – but what does this do to NetApp’s 4K byte blocks?

Jim – 08/30/13
@itbycrayon

View Jim Surlow's profile on LinkedIn (I don’t accept general LinkedIn invites – but if you say you read my blog, it will change my mind)

Migrating to the Cloud – Technical Concerns of migrating to an IaaS Cloud

Blue SquareThe thoughts of migrating to the Cloud can be flippant or daunting depending on where you sit on the optimist/pessimist scale. In reality, this is a matter of proportion to your environment.

In this week’s post, I’m talking specifically about Infrastructure-as-a-Service Cloud — rather than having a physical presence, your goal is to move to the cloud, so you don’t have to care for that hardware stuff.

My recommendation on where to begin is to ask the question: How would I migrate somewhere else?

It starts by – what services do I move 1st? When I worked at UC Irvine‘s School of Humanities in the early 90s, we had to move into a new building and the finance staff needed to move 1st since they didn’t want to get caught with closing the books at the very time we had to evacuate the old modular building. So, a server had to go over there to provide the Netware routing that we were doing between a classroom network and the office network (it was summer, so I didn’t have to worry about student congestion on the network – though the empty classroom I put the server in was victim of the painters unplugging it). After the office staff could move, then I could bring the office Netware server over one evening. The important part of this story is that I needed networking to handle the people 1st.

Another move that I performed was similar. I had to move servers from downtown Denver to a new data center in south Denver. The users of those machines couldn’t deal with the network latency as our route went from downtown Denver to Massachusetts to south Denver. Those users had to move, then they had to get new AD (Microsoft Active Directory) credentials and new security tokens. So, the important part here is that the users needed an authentication infrastructure 1st.

While moving some servers from Denver to Aurora, into a new facility for us – we again were concerned about latency, so we needed to have authentication and name services also stood up in Aurora, so that authentication wouldn’t have to cross the WAN.

My point from these anecdotes is it is not just as simple as moving one OS instance. There are dependencies. Typically, those dependencies are infrastructure dependencies, and they typically exist so that latency can be avoided. [I haven’t defined network latency — but, for those who need an example — think of TV interviews that occur when the anchor is in New York and the reporter is in the middle east. The anchor asks a question and the reporter has to wait for all the audio to get to him while the viewer sees a pause in conversation. That is network latency. The amount of time it takes to travel the “wire”].

Back to dependencies – I may need DNS (domain name service) at the new site, so that every time I look up itbycrayon.com, I don’t have to have the server talk back to my local network to get that information. I may need authentication services (e.g AD). I may need a network route outbound. I may need a database server. Now, these start to add up.

In my experience, there is a 1st wave – infrastructure services.

Then there is a 2nd wave – actual systems used by users. Typically, these are some guinea pigs which can endure the kinks being smoothed out.

Eventually, there are a bunch of systems that are all interrelated. This wave ends up being quite an undertaking, as this bulk of systems takes time to move and users are going to want minimal downtime.

Then after the final user wave, the final clean-up occurs – decommissioning the old infrastructure servers.

20130819-213319.jpg

What I’ve presented is more about how to do a migration than a migration into the cloud. For the cloud, there may be additional steps depending on your provider – maybe you have to convert VMware VMs using OVFtool and then import.

VM portability eases the task. The underlying hardware tends to be irrelevant – as opposed to moving physical servers where there may be different driver stacks, different devices, etc. Obviously, one has to be cognizant of compatibility. If one is running IBM AIX, then one must find a cloud provider that supports this.

My point is that it is still a migration of how to get from A to B, and high level requirements remain the same (How is my data going to move – over the wire or by truck? What can I live with? What systems depend on what other systems?). The big difference between an IaaS Cloud migration and a physical migration is that servers aren’t moving from site A to site B – so there isn’t the “swing gear” conversation or the “physical move” conversation. This is a migration of landing on pre-staged gear. The destination is ready. Figure out the transport requirements of your destination cloud and get going!