IOPS, Spinning Disk, and Performance – What’s the catch?

Black Diamond
For a quick introduction – IOPS means Input Output (operations) per second.  Every hard drive has certain IO performance.  So, forgive the oversimplification, add additional disks, one gets additional IOPS which means one gets better performance.

Now, generally speaking, I hate IOPS as a performance characteristic.  I hate them, because, IOPS can be read or write and sequential or random and of different IO sizes.  Unless one is trying to tune for a specific application and is dedicating specific disk drives to the application, the measurement breaks down as the description of the assumed utilization lacks accuracy.  For instance, assume that it has random reads & writes, but then the backups kick off and that ends up being a huge sequential read for long durations.

But, I digress.

20130625-211646.jpg Every hard drive has an IOPS rating whether SAS, SATA, or FibreChannel or 7200, 10000 or 15000 RPM.  (see wikipedia for a sample).  When a RAID set is established, drives of the same geometry (speed & size) are put together to stripe the data across the drives. For simplicity sake, lets say one uses a RAID5 set with 6 drives:  that is the capacity of 1 drive is used for error (parity) checking and 5 for data.  And continuing the example, assume that these are 1TB (terabyte) drives with 100 IOPS per drive.  So, one has 5 TB of capacity and 500 IOPS.  [Let’s imagine these are read IOPS and not write, so I don’t have to get into parity calculations, etc. etc.].    If I could add a drive to the RAID set, then I get another TB and another 100 IOPS.  Nice and linear.

And, my IOPS per TB are constant.  [Again, to simplify, I’m going to assume that it falls in the same RAID set and so I don’t have to consider more parity drive space].  So, none of this should be earth shaking.

The huge implication here is:  To increase performance, add more disks.   The more disks, the more IOPS, everyone’s happy.  However, that assumes that consumption (and more importantly IOPS demand) has not increased.  The graph on the right looks consistent with the graphs that we saw earlier.

20130625-211653.jpgThe problem here is that if one adds disks, which adds capacity, and then that capacity is consumed at the same IO rate as the original disk space, the performance curve looks like the graph on the left.  If I’m consuming 100 IOPS per TB and I have 5 TB, that is 500 IOPS of demand.  So, I add a 1TB disk and now I have 600 IOPS w/5TB of used capacity on 6TB of disk.  So, I can spread that out and yippie, those 5TBs can get 120IOPS.  But, if I also say, “hey, I got another TB of disk space” and then consume it, then I’m back to where I started and am still constrained at 100IOPS/TB.  So, what good is this?

20130625-211659.jpgThe assumption is that one is adding to a heterogenous array i.e. multi-purpose (maybe multi-user or multi-system).  So, by being multi-purpose, the usage curve should hopefully become more normalized.  If the usage is more homogenous, e.g. everyone who needs fast performance, so we move them from the slow array to the fast array – well that just means that the fast users are competing with other fast users.

Just like on the NASCAR track for time trials, if I have one race car start and then send another race car when the 1st is half way around the track, I’m probably not going to have contention.  If one customer wants high performance in the evening and the other in the business day, I probably have no contention.

However, on race day after the start, all the cars are congested and some can’t go as fast they want because someone is slow in front of them – gee, and we moved them off the freeway onto the race track for just this reason.   Well, on the storage array, this is like everyone running end-of-the-month reports, well, at the end-of-the month.

I need another analogy for the heterogenous use.  Imagine a road that one guy uses daily, but his neighbor only uses it monthly.  However, the neighbor still needs use of a road, so he pays for the consumption as well.  Overall, there may not be conflict for the road resource – as opposed to, if both used it daily.

So, yes, overall – adding disks does add performance capacity.  And without knowing usage characteristics, the generality of adding disks still holds.  Why?  Because no one complains that the disks are going too fast, they only complain when it is too slow.  There is still the mindset that one buys disk for capacity and not for performance.  And then once performance is an issue, the complaints start.  So, adding disks, to a random workload means that the bell curve should get smoother over all.  This won’t end all the headaches, but should minimize them by minimizing the number of potential conflicts.

Let me know what you think