Problem calculating workloads on Storage, in this case NetApp

Double Black Diamond

With a centralized storage array, there can be front-side limitations (outside of the array to the host or client) and back-side limitations (the actual disk in the storage array).

The problem that occurs is that from the storage array point of view, the workloads at any given moment in time are random and from the array the details of the workloads are invisible.  So, to alleviate load on the array has to be determined from the client side not the storage side.

Take for example a VMware environment with NFS storage on a NetApp array:image

Each ESX host has some number of VMs and each ESX host is mounting the same export from the NetApp array.

 

Let IA = The Storage Array’s front side IOPS load.
Let hn(t) = The IOPS generated from a particular host at time t and n = number of ESX hosts.

 

The array’s front side IOPS load at time t, equals the sum of IOPS load of each ESX host at time t.

IA(t) = Σ hn(t)

 

An ESX host’s IOPS load at time t, equals the sum of the IOPS of each VM on the host at time t.

h(t) = Σ VMn(t)

 

A VM’s IOPS load at time t, equals the sum of the Read IOPS & Write IOPS on that VM at time t.

VM(t) = R(t) + W(t)

 

The Read IOPS are composed of those well formed Reads and not well formed reads.  “Well formed reads” are reads which will not incur a penalty on the back side of the storage array.  “Not well formed reads” will generate anywhere between 2 and 4 additional IOs on the back side of the storage array.

Let r1 = Well formed IOs

Let r2 = IOs which cause 1 additional IO on the back side of the array.

Let r3 = IOs which cause 2 additional IOs on the back side of the array.

Let r4 = IOs which cause 3 additional IOs on the back side of the array.

Let r5 = IOs which cause 4 additional IOs on the back side of the array.

Then

R(t) = ar1(t) + br2(t) + cr3(t) + dr4(t) + er5(t)

Where a+b+c+d+e = 100% and a>0, b>0, c>0, d>0, e>0

and

W(t) = fw1(t) + gw2(t) + hw3(t) + iw4(t) + jw5(t)

Where f+g+h+i+j = 100% and f>0, g>0, h>0, i>0, j>0

Now for the back side IOPS (and I’m ignoring block size here which would just add a factor into the equation of array block size divided by block size).  The difference is to deal with the additional IOs.

R(t) = ar1(t) + 2br2(t) + 3cr3(t) + 4dr4(t) + 5er5(t)

and

W(t) = fw1(t) + 2gw2(t) + 3hw3(t) + 4iw4(t) + 5jw5(t)

Since the array cannot predetermine the values for a-i, it cannot determine the effects of an additional amount of IO.  Likewise it cannot determine if the host(s) are going to be sending sequential or random IO.  It will trend toward the random given n number of machines concurrently writing and the likelihood of n-1 systems being quite while 1 is sending sequential is low.

Visibility into the host side behaviors from the host side is required.

 

Jim – 10/01/14

@itbycrayon

View Jim Surlow's profile on LinkedIn (I don’t accept general LinkedIn invites – but if you say you read my blog, it will change my mind)

Advertisements

How often do disks fail these days?

Black DiamondGood news & Bad news & more Bad news:   The good news is that there has been a couple of exhaustive studies on hard drive failure rates.  The bad news is the studies are old.   More bad news:  With the growing acceptance of solid state drives, we may not see another study.

Google’s Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso wrote a great paper, “Failure Trends in a Large Disk Drive Population“, submitted to USENIX FAST conference of ’07.  Reading this paper in today’s context, the takeaways are:

  • The study was on consumer-grade (aka SATA) drives.
  • Operating Temperature is not a good predictor of failure (i.e. it is not conclusive that drives operating in higher temp environments fail more frequently).
  • Disk utilization (wear) is not a good predictor of failure.
  • Failure rates over time graph more check mark or fish hook curve than bathtub.

    Fish hook v. Bathtub curves

    Fish hook v. Bathtub curves

Carnegie Mellon’s Bianca Schroeder & Garth A. Gibson also published a paper, “Disk failures in the real world:
What does an MTTF of 1,000,000 hours mean to you?
” , at USENIX’S FAST 2007 conference.  The takeaways from this paper are that their study was on FC, SCSI, & SATA disks and that in the first 5 years, it is reasonable to expect a 3% failure rate again following a check mark pattern.  After 5 years, the failure rates are, of course, more significant.

Here’s the problem:  In 2007, they are analyzing disks, where some are over 5 years old.  So, is there any expectation that these failure rates are good proxies for our disks in the last 6 years?  (The CMU paper has a list of disks that were sampled, if the reader cares to see). It would be natural to assume that disk drive vendors have improved the durability of their products.  But, maybe the insertion of SAS (Serial Attached SCSI) drives into the technology mix might introduce similar failure rates, being that the technology is new(er).

Another study of note was a University of Illinois study by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky submitted during the ’08 FAST conference, “Are Disks the Dominant Contributor for Storage Failures?” which today’s takeaway is that when there is a single disk failure, there is another potential failure sooner than odds would dictate (i.e. that the events can be correlated).

This seems to be in place today as well as the research presented here was a result of cascading failures initially stemming from a disk failure.  So, the question is:  When can we expect another disk failure?

The research being authoritative and vendor-neutral would be helpful if the data were current.    Until then, this data needs to be used as a proxy for predictions – or one can use anecdotal information.  sigh.

Jim

@itbycrayon
<a href=”http://www.linkedin.com/pub/jim-surlow/7/913/b80″&gt;

<img src=”http://www.linkedin.com/img/webpromo/btn_myprofile_160x33.png&#8221; width=”160″ height=”33″ border=”0″ alt=”View Jim Surlow’s profile on LinkedIn”>

</a>

IOPS, Spinning Disk, and Performance – What’s the catch?

Black Diamond
For a quick introduction – IOPS means Input Output (operations) per second.  Every hard drive has certain IO performance.  So, forgive the oversimplification, add additional disks, one gets additional IOPS which means one gets better performance.

Now, generally speaking, I hate IOPS as a performance characteristic.  I hate them, because, IOPS can be read or write and sequential or random and of different IO sizes.  Unless one is trying to tune for a specific application and is dedicating specific disk drives to the application, the measurement breaks down as the description of the assumed utilization lacks accuracy.  For instance, assume that it has random reads & writes, but then the backups kick off and that ends up being a huge sequential read for long durations.

But, I digress.

20130625-211646.jpg Every hard drive has an IOPS rating whether SAS, SATA, or FibreChannel or 7200, 10000 or 15000 RPM.  (see wikipedia for a sample).  When a RAID set is established, drives of the same geometry (speed & size) are put together to stripe the data across the drives. For simplicity sake, lets say one uses a RAID5 set with 6 drives:  that is the capacity of 1 drive is used for error (parity) checking and 5 for data.  And continuing the example, assume that these are 1TB (terabyte) drives with 100 IOPS per drive.  So, one has 5 TB of capacity and 500 IOPS.  [Let’s imagine these are read IOPS and not write, so I don’t have to get into parity calculations, etc. etc.].    If I could add a drive to the RAID set, then I get another TB and another 100 IOPS.  Nice and linear.

20130625-211611.jpg
And, my IOPS per TB are constant.  [Again, to simplify, I’m going to assume that it falls in the same RAID set and so I don’t have to consider more parity drive space].  So, none of this should be earth shaking.

20130625-211636.jpg
The huge implication here is:  To increase performance, add more disks.   The more disks, the more IOPS, everyone’s happy.  However, that assumes that consumption (and more importantly IOPS demand) has not increased.  The graph on the right looks consistent with the graphs that we saw earlier.

20130625-211653.jpgThe problem here is that if one adds disks, which adds capacity, and then that capacity is consumed at the same IO rate as the original disk space, the performance curve looks like the graph on the left.  If I’m consuming 100 IOPS per TB and I have 5 TB, that is 500 IOPS of demand.  So, I add a 1TB disk and now I have 600 IOPS w/5TB of used capacity on 6TB of disk.  So, I can spread that out and yippie, those 5TBs can get 120IOPS.  But, if I also say, “hey, I got another TB of disk space” and then consume it, then I’m back to where I started and am still constrained at 100IOPS/TB.  So, what good is this?

20130625-211659.jpgThe assumption is that one is adding to a heterogenous array i.e. multi-purpose (maybe multi-user or multi-system).  So, by being multi-purpose, the usage curve should hopefully become more normalized.  If the usage is more homogenous, e.g. everyone who needs fast performance, so we move them from the slow array to the fast array – well that just means that the fast users are competing with other fast users.

Just like on the NASCAR track for time trials, if I have one race car start and then send another race car when the 1st is half way around the track, I’m probably not going to have contention.  If one customer wants high performance in the evening and the other in the business day, I probably have no contention.

However, on race day after the start, all the cars are congested and some can’t go as fast they want because someone is slow in front of them – gee, and we moved them off the freeway onto the race track for just this reason.   Well, on the storage array, this is like everyone running end-of-the-month reports, well, at the end-of-the month.

I need another analogy for the heterogenous use.  Imagine a road that one guy uses daily, but his neighbor only uses it monthly.  However, the neighbor still needs use of a road, so he pays for the consumption as well.  Overall, there may not be conflict for the road resource – as opposed to, if both used it daily.

So, yes, overall – adding disks does add performance capacity.  And without knowing usage characteristics, the generality of adding disks still holds.  Why?  Because no one complains that the disks are going too fast, they only complain when it is too slow.  There is still the mindset that one buys disk for capacity and not for performance.  And then once performance is an issue, the complaints start.  So, adding disks, to a random workload means that the bell curve should get smoother over all.  This won’t end all the headaches, but should minimize them by minimizing the number of potential conflicts.

Let me know what you think
Jim
@itbycrayon

Replication Methodologies

Blue SquareFrequently, when one tries to determine a disaster recovery transfer methodology, one gets stuck on the different methods available as they each have their strengths and weaknesses.  I’ll ignore backups for the time being – backup images can be tapes shipped by truck from primary site to recovery site, or backups replicated by backup appliances.  The focus will be on replication of “immediately” usable data sent over the network from primary site to recovery site.

Types of replication:

  • Application Replication – Using an application to move data.  This could be Active Directory, Microsoft SQL Server, Oracle DataGuard, or other application specific methods
  • OS Replication – This typically uses some application within the Operating system and used to ship files … this could be something as simple as rsync
  • Hypervisor Replication – For those that run virtualized environments, replication occurs at the hypervisor layer.  This could be VMware vSphere Replication
  • Storage Replication – This is when an array replicates the data.

20130625-213218.jpg

With Application Replication – one can be confident that the application is replicated in a usable form.  Without quiescing the applications, applications may not be entirely recoverable.  (quiescing is the process of placing the application in a state that is usable – usually, this consists of flushing all data in RAM to disk).  However, the question that inevitably comes up is:  “What about all the other systems I will need at the recovery site?”

With OS Replication – the application that moves the data, tends to be fairly simple to understand:  copy this directory structure (or folder) to the receiving system and place in such and such folder.  However, with this methodology, the question that arises is:  “What about the registry and open files?”  If the process just sweeps the filesystem for changes, what happens to files that are open by some other application (i.e. not quiesced — could be a problem with database servers)?

With Hypervisor Replication (for virtualized environments) – The hypervisor sits underneath the guest operating system(s).  This has the advantage of catching the writes from the guest.  To write to disk, the guest has to write to the hypervisor and then the hypervisor writes to disk.  The data can then be duplicated at this level.  For this method, the inevitable question is:  “What about by physical machines that haven’t been virtualized?”  One can still have issues with non-quiesced apps.

With Storage Array Replication – The array sends to its peer at the remote site.  The advantage of this method is that it can handle virtual or physical environments.  The disadvantage is that an array on the remote side must match the protected site.  This can be problematic, as frequently, sophisticated arrays with this technology tend to be costly.

Regardless of the methodology, there has to be coordination between the two sides.  The coordination tends to be handled by having whichever facility at each end provided by the same vendor:  e.g. SQL server to SQL server, VMware vSphere Replication appliances to vSphere Replication appliances, Storage vendor like NetApp to NetApp.  [I’m ignoring Storage Hypervisors and just bundling those into the Storage Array layer.]

Every method has its own set of rules.  Imagine I had a document where I wanted a copy across town.  I could:

  1. walk it over myself – but I would have to follow the rules of the road and not cross against red lights
  2. e-mail it – but I would need to know the e-mail address and the recipient would need to have e-mail to receive it.
  3. fax it – but I would need a fax machine and the recipient would need a fax machine and I would need to dial the number of the recipient’s fax machine.
  4. send it with US postal – but I would need to fill out the envelope appropriately, with name, address, city, state.
  5. call the recipient and dictate it – but I would need the phone number and the recipient needs a phone.  I would also need to speak the same language as the person on the other end of the phone.

These simplistic examples that we use every day, are used here to illustrate that there are certain norms that we are so familiar with that we ignore that they are norms.  But, without these norms, the communication could not be done.  Likewise, users should not be surprised that each data replication method has its own requirements and that the vendors are picky as they wish to define the method of communication.

Jim

Rev 1.0 06/25/13