Simplivity Storage – hunting VMs when low on storage.

With HPE Simplivity, each VM has storage mirrored on another specific host in the cluster. As an example, in a 3 host VMware cluster, a VM on ESXi host #1 will have a mirrored copy of the storage on ESXi host #2, but no other data on ESXi host #3. Unlike other HyperConverged Infrastructure products in the marketplace, Simplivity mirrors the VM as opposed to striping the VM across other nodes in the cluster.

This has its advantages and disadvantages.

One result is that each node in the cluster ends up with different storage consumption, as the blend of VMs on a host vary. In the following example:

  • ESXi Host #1 could have VMs: a, d, e, f
  • ESXi Host #2 could have VMs: a, b, c, e
  • ESXi Host #3 could have VMs: b, c, d, f

As a result, at any given time, 2 nodes will be closer to full than the remainder. “Closer” could be significant or insignificant, but the point is relevent to the script that I’ll include below.

The Simplivity storage environment is controlled by what are call Omnistack Virtual Controllers (“OVCs” for short). Each ESXi host has a dedicated OVC, but the OVCs in a VMware cluster work together.

(I don’t know the Simplivity interface from Hyper-V, so what follows is VMware specific.) From vSphere, it is cumbersome to determine where Simplivity stores copies of a VM. Clearly, if ESXi Host 1 is nearing full, one will see a VMware alarm and one could go to the host and see the VMs with a primary copy of storage on that host – but what about a VM with a secondary (redundant mirrored) copy? There would have to be some deciphering of the DRS rules.

Simplivity offers a command line tool through the OVC which will list all the VMs in the cluster and where the primary and secondary copies are stored and the VM size: dsv-balance-manual

The drawbacks of this tool are: first, one can only see the size of the VM, one cannot see the size of the VM AND all its associated backups. Secondly, it does not report any remote backups copied from another cluster to the hosts, if any.

When storage runs tight, removing backups is the most likely path forward. Having a total size for a VM which would include its associated backups would be very helpful – but with de-duplication and compression across VMs on the host while backups expiring at different intervals this becomes very difficult to calculate and would only be valid in real time.

The first step to gaining space would be to search the backups on the cluster to determine if there are any backups which lack an expiration date, and remove as necessary.

The second step might be to identify which VMs are shared on the 2 most full hosts.

It is fairly easy to eyeball the output to dsv-balance-manual, but when one runs it often and if there are many VMs, human error can kick in. I wrote the following CLI command pipeline to do this:

node=(`sudo /var/tmp/build/dsv/dsv-balance-show --ShowNodeIndex | sed 's/\(.B \)/ \1/;s/\.\([0-9][0-9]\) TB/\10 GB/' | awk '/^\| Node [0-9]/ {print $3,$15}' | sort -nr +1 | awk 'BEGIN {z=10^6}{b=a;a=$1;if ($1 < z) {z=$1}} END {print a-z+1,b-z+1}'`) ; cl=(`svt-federation-show | awk -F"\| " '/Alive/ {if ($3 ~ /^[a-zA-Z]/) {x=$3};{a=x} ; if ($4 ~ /^[a-zA-Z]/) {y=$4};{b=y} ; print a,b,$9}' | grep \`ifconfig eth0 | awk '/inet/ {print $2}'\``) ; sudo /var/tmp/build/dsv/dsv-balance-manual --datacenter ${cl[0]} --cluster ${cl[1]} > /dev/null ; awk -F, -v n=${node[0]} -v m=${node[1]}  '/\]/ {offset=2 ; if (($(n+offset) ~ /s|p/) && ($(m+offset) ~ /s|p/)) print $(NF-2),$(NF-1)}' /tmp/balance/replica_distribution_file_${cl[0]}.csv | sort -n 

Had I known it would be so long, I probably would have written a script, but then the script would have to be pushed to all OVCs on all the hosts and with the Simplivity upgrade procedure, the OVCs would be wiped out and the scripts re-written.


This needs to be run on an OVC in the cluster that is short on space (i.e. won’t work across clusters).

Create the node array – 2 entries with the 2 nodes in the cluster with the least available space.


Run Simplivity command as root to determine which nodes lack space, this will include space remaining (as opposed to consumed).

sudo /var/tmp/build/dsv/dsv-balance-show --ShowNodeIndex

Add a space between digits of storage and label of storage, and convert TB to GB by removing the decimal point and adding a zero

| sed 's/\(.B \)/ \1/;s/\.\([0-9][0-9]\) TB/\10 GB/'

Find the lines with only details about the nodes (throw out the headers) and only print the IP address and the storage consumed (the label above is now discarded).

| awk '/^\| Node [0-9]/ {print $3,$15}'

Sort numerically by storage consumed in descending order.

| sort -nr +1

Print the last 2 entries and reset the node number so that it counts from 1 – the output from the earlier Simplivity command depending on retired equipment, might not start from 1. (Unsure how this behaves if run on a cluster with less than 1 host – but one would not need to run this script if there was only 1 host).

| awk 'BEGIN {z=10^6}{b=a;a=$1;if ($1 < z) {z=$1}} END {print a-z+1,b-z+1}'

This completes the array. The contents of the array are 2 numbers reflecting the nodes which have the least space available.


Set the datacenter and cluster variables. This is a lot of code to include what is already known, but will reduce human error of misspellings. Set the “cl” array (cluster).


Run the Simplivity command to show all the nodes in the federation.


Find only the lines that include nodes. Given the output from above, if the datacenter field (#3) is empty, print what was in the line before and if the cluster field (#4) is empty, print what was in the line before. Finally, only print datacenter, cluster, and management IP.

| awk -F"\| " '/Alive/ {if ($3 ~ /^[a-zA-Z]/) {x=$3};{a=x} ; if ($4 ~ /^[a-zA-Z]/) {y=$4};{b=y} ; print a,b,$9}'

Search the output of the above, with output of what follows.

| grep \`

Determine the IP of the management IP of this OVC.

ifconfig eth0 | awk '/inet/ {print $2}'\`

Finalize the array with datacenter, cluster, and IP (the latter won’t be used).

`) ;

Run the Simplivity command to list the VMs and add the datacenter and cluster information so that it can run unattended, dump the output to /dev/null, as an output file will be left behind.

sudo /var/tmp/build/dsv/dsv-balance-manual –datacenter ${cl[0]} –cluster ${cl[1]} > /dev/null ;

sudo /var/tmp/build/dsv/dsv-balance-manual --datacenter ${cl[0]} --cluster ${cl[1]} > /dev/null ;

Parse the output file: /tmp/balance/replica_distribution_file_<cluster name>.csv Use the 2 variables, m & n, to represent the 2 nodes to search for. The CSV is offset by 2 other data points before the node data is included and a “p” or “s” for primary or secondary copy. The number of nodes will determine the number of columns. The 2nd to last column is the VM name shown by $(NF-1). The 3rd to last column is the VM size.

awk -F, -v n=${node[0]} -v m=${node[1]} '/\]/ {offset=2 ; if (($(n+offset) ~ /s|p/) && ($(m+offset) ~ /s|p/)) print $(NF-2),$(NF-1)}' /tmp/balance/replica_distribution_file_${cl[0]}.csv

Then sort the output numerically in ascending fashion, so the largest VMs are at the bottom.

| sort -n 

Given all this, it could be scripted to look cleaner and could be made more tidy.

Notes from 2018 VMWorld

Day 1 kicked off with the announcement of AWS’ DBs running on vCenter.  This is to dovetail with VMware in AWS.  Trying to figure out how this makes sense, allowing AWS to make their move to on-prem.  Makes me wonder if the same thinking here was the same thinking that led the wizard Saruman to make a deal with the dark lord Sauron.

VMware clearly is trying to prepare for a day where there are fewer on-prem workloads.  There is more management with vRealize Cloud in the vRealize Suite.  There is more growth in the replicate to VMC in AWS.

Meanwhile on the WAN front, there is the big announcement of the Velocloud purchase and pulling that into the NSX family.  CEO Pat Gelsinger has that as a major announcement in his keynote following the AWS portion.  I thought that NSX was going to be omitted from mention, but there they did get their spot before the end of the keynote – so it is still moving along.

On the vendor front, they occasionally threw an obvious line to Dell/EMC their major shareholders.  This was especially true when it came to discussion around vSAN.  vSAN seems to have a positive direction as it was positioned as a relevant technology as opposed to the bumps it had in years past where they were touting features over adoption.  When talking about HyperConverged, it was vSAN, and it was Dell/EMC.  And there was the subtle exclusion of Cisco on the slides.  Dell/EMC, HPE, Lenovo were on the compute slides, but not Cisco.

There was a huge portion of Day 2 keynote to social action which Gelsinger mentioned a bit in Day 1.  He specifically quoted and rejected Milton Friedman’s premise that a company should maximize shareholder value. I’m not sure if (1) he really believes this (2) he’s following Michael Dell’s lead being the majority shareholder (3) he’s trying to appease the Millennial crowd who want to see companies doing this – and these are the folks he wants to hire, or (4) some combination.  It is probably some combination that leads to this errant view.  But, there was a Nobel prize winner on Day 2 – though in the overflow room, the audio went out for minutes as they introduced her.  A lot of time was spent trying to come up to speed afterward.

There were two announcements also around vmotion/svmotion on ARM chips and NVIDIA GPUs (not together).

The vendor hall was filled with the vendors you’d expect.  Apparently, t-shirts are back.  At my conferences attending a year ago, it seemed customer dress socks were the rage.  This was my first multi-day conference at Mandalay Bay.  The catered food is better at the Venetian/Sands convention complex than Mandalay’s.  Personally, I spent a fair amount talking to different vendors that are involved in one of my projects to get advice on supportability of an odd use case that I’m dealing with.

The lab environment was done well – held on the bottom floor away from the top two levels which are much noisier (live music, etc.).

The 25,000 attendees made for congested hallways and escalators.  But, I saw more former Denver colleagues there than I do when I’m around town here.

Problem calculating workloads on Storage, in this case NetApp

Double Black Diamond

With a centralized storage array, there can be front-side limitations (outside of the array to the host or client) and back-side limitations (the actual disk in the storage array).

The problem that occurs is that from the storage array point of view, the workloads at any given moment in time are random and from the array the details of the workloads are invisible.  So, to alleviate load on the array has to be determined from the client side not the storage side.

Take for example a VMware environment with NFS storage on a NetApp array:image

Each ESX host has some number of VMs and each ESX host is mounting the same export from the NetApp array.


Let IA = The Storage Array’s front side IOPS load.
Let hn(t) = The IOPS generated from a particular host at time t and n = number of ESX hosts.


The array’s front side IOPS load at time t, equals the sum of IOPS load of each ESX host at time t.

IA(t) = Σ hn(t)


An ESX host’s IOPS load at time t, equals the sum of the IOPS of each VM on the host at time t.

h(t) = Σ VMn(t)


A VM’s IOPS load at time t, equals the sum of the Read IOPS & Write IOPS on that VM at time t.

VM(t) = R(t) + W(t)


The Read IOPS are composed of those well formed Reads and not well formed reads.  “Well formed reads” are reads which will not incur a penalty on the back side of the storage array.  “Not well formed reads” will generate anywhere between 2 and 4 additional IOs on the back side of the storage array.

Let r1 = Well formed IOs

Let r2 = IOs which cause 1 additional IO on the back side of the array.

Let r3 = IOs which cause 2 additional IOs on the back side of the array.

Let r4 = IOs which cause 3 additional IOs on the back side of the array.

Let r5 = IOs which cause 4 additional IOs on the back side of the array.


R(t) = ar1(t) + br2(t) + cr3(t) + dr4(t) + er5(t)

Where a+b+c+d+e = 100% and a>0, b>0, c>0, d>0, e>0


W(t) = fw1(t) + gw2(t) + hw3(t) + iw4(t) + jw5(t)

Where f+g+h+i+j = 100% and f>0, g>0, h>0, i>0, j>0

Now for the back side IOPS (and I’m ignoring block size here which would just add a factor into the equation of array block size divided by block size).  The difference is to deal with the additional IOs.

R(t) = ar1(t) + 2br2(t) + 3cr3(t) + 4dr4(t) + 5er5(t)


W(t) = fw1(t) + 2gw2(t) + 3hw3(t) + 4iw4(t) + 5jw5(t)

Since the array cannot predetermine the values for a-i, it cannot determine the effects of an additional amount of IO.  Likewise it cannot determine if the host(s) are going to be sending sequential or random IO.  It will trend toward the random given n number of machines concurrently writing and the likelihood of n-1 systems being quite while 1 is sending sequential is low.

Visibility into the host side behaviors from the host side is required.


Jim – 10/01/14


View Jim Surlow's profile on LinkedIn (I don’t accept general LinkedIn invites – but if you say you read my blog, it will change my mind)

How often do disks fail these days?

Black DiamondGood news & Bad news & more Bad news:   The good news is that there has been a couple of exhaustive studies on hard drive failure rates.  The bad news is the studies are old.   More bad news:  With the growing acceptance of solid state drives, we may not see another study.

Google’s Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso wrote a great paper, “Failure Trends in a Large Disk Drive Population“, submitted to USENIX FAST conference of ’07.  Reading this paper in today’s context, the takeaways are:

  • The study was on consumer-grade (aka SATA) drives.
  • Operating Temperature is not a good predictor of failure (i.e. it is not conclusive that drives operating in higher temp environments fail more frequently).
  • Disk utilization (wear) is not a good predictor of failure.
  • Failure rates over time graph more check mark or fish hook curve than bathtub.

    Fish hook v. Bathtub curves

    Fish hook v. Bathtub curves

Carnegie Mellon’s Bianca Schroeder & Garth A. Gibson also published a paper, “Disk failures in the real world:
What does an MTTF of 1,000,000 hours mean to you?
” , at USENIX’S FAST 2007 conference.  The takeaways from this paper are that their study was on FC, SCSI, & SATA disks and that in the first 5 years, it is reasonable to expect a 3% failure rate again following a check mark pattern.  After 5 years, the failure rates are, of course, more significant.

Here’s the problem:  In 2007, they are analyzing disks, where some are over 5 years old.  So, is there any expectation that these failure rates are good proxies for our disks in the last 6 years?  (The CMU paper has a list of disks that were sampled, if the reader cares to see). It would be natural to assume that disk drive vendors have improved the durability of their products.  But, maybe the insertion of SAS (Serial Attached SCSI) drives into the technology mix might introduce similar failure rates, being that the technology is new(er).

Another study of note was a University of Illinois study by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky submitted during the ’08 FAST conference, “Are Disks the Dominant Contributor for Storage Failures?” which today’s takeaway is that when there is a single disk failure, there is another potential failure sooner than odds would dictate (i.e. that the events can be correlated).

This seems to be in place today as well as the research presented here was a result of cascading failures initially stemming from a disk failure.  So, the question is:  When can we expect another disk failure?

The research being authoritative and vendor-neutral would be helpful if the data were current.    Until then, this data needs to be used as a proxy for predictions – or one can use anecdotal information.  sigh.


<a href=”″&gt;

<img src=”; width=”160″ height=”33″ border=”0″ alt=”View Jim Surlow’s profile on LinkedIn”>


IOPS, Spinning Disk, and Performance – What’s the catch?

Black Diamond
For a quick introduction – IOPS means Input Output (operations) per second.  Every hard drive has certain IO performance.  So, forgive the oversimplification, add additional disks, one gets additional IOPS which means one gets better performance.

Now, generally speaking, I hate IOPS as a performance characteristic.  I hate them, because, IOPS can be read or write and sequential or random and of different IO sizes.  Unless one is trying to tune for a specific application and is dedicating specific disk drives to the application, the measurement breaks down as the description of the assumed utilization lacks accuracy.  For instance, assume that it has random reads & writes, but then the backups kick off and that ends up being a huge sequential read for long durations.

But, I digress.

20130625-211646.jpg Every hard drive has an IOPS rating whether SAS, SATA, or FibreChannel or 7200, 10000 or 15000 RPM.  (see wikipedia for a sample).  When a RAID set is established, drives of the same geometry (speed & size) are put together to stripe the data across the drives. For simplicity sake, lets say one uses a RAID5 set with 6 drives:  that is the capacity of 1 drive is used for error (parity) checking and 5 for data.  And continuing the example, assume that these are 1TB (terabyte) drives with 100 IOPS per drive.  So, one has 5 TB of capacity and 500 IOPS.  [Let’s imagine these are read IOPS and not write, so I don’t have to get into parity calculations, etc. etc.].    If I could add a drive to the RAID set, then I get another TB and another 100 IOPS.  Nice and linear.

And, my IOPS per TB are constant.  [Again, to simplify, I’m going to assume that it falls in the same RAID set and so I don’t have to consider more parity drive space].  So, none of this should be earth shaking.

The huge implication here is:  To increase performance, add more disks.   The more disks, the more IOPS, everyone’s happy.  However, that assumes that consumption (and more importantly IOPS demand) has not increased.  The graph on the right looks consistent with the graphs that we saw earlier.

20130625-211653.jpgThe problem here is that if one adds disks, which adds capacity, and then that capacity is consumed at the same IO rate as the original disk space, the performance curve looks like the graph on the left.  If I’m consuming 100 IOPS per TB and I have 5 TB, that is 500 IOPS of demand.  So, I add a 1TB disk and now I have 600 IOPS w/5TB of used capacity on 6TB of disk.  So, I can spread that out and yippie, those 5TBs can get 120IOPS.  But, if I also say, “hey, I got another TB of disk space” and then consume it, then I’m back to where I started and am still constrained at 100IOPS/TB.  So, what good is this?

20130625-211659.jpgThe assumption is that one is adding to a heterogenous array i.e. multi-purpose (maybe multi-user or multi-system).  So, by being multi-purpose, the usage curve should hopefully become more normalized.  If the usage is more homogenous, e.g. everyone who needs fast performance, so we move them from the slow array to the fast array – well that just means that the fast users are competing with other fast users.

Just like on the NASCAR track for time trials, if I have one race car start and then send another race car when the 1st is half way around the track, I’m probably not going to have contention.  If one customer wants high performance in the evening and the other in the business day, I probably have no contention.

However, on race day after the start, all the cars are congested and some can’t go as fast they want because someone is slow in front of them – gee, and we moved them off the freeway onto the race track for just this reason.   Well, on the storage array, this is like everyone running end-of-the-month reports, well, at the end-of-the month.

I need another analogy for the heterogenous use.  Imagine a road that one guy uses daily, but his neighbor only uses it monthly.  However, the neighbor still needs use of a road, so he pays for the consumption as well.  Overall, there may not be conflict for the road resource – as opposed to, if both used it daily.

So, yes, overall – adding disks does add performance capacity.  And without knowing usage characteristics, the generality of adding disks still holds.  Why?  Because no one complains that the disks are going too fast, they only complain when it is too slow.  There is still the mindset that one buys disk for capacity and not for performance.  And then once performance is an issue, the complaints start.  So, adding disks, to a random workload means that the bell curve should get smoother over all.  This won’t end all the headaches, but should minimize them by minimizing the number of potential conflicts.

Let me know what you think

Replication Methodologies

Blue SquareFrequently, when one tries to determine a disaster recovery transfer methodology, one gets stuck on the different methods available as they each have their strengths and weaknesses.  I’ll ignore backups for the time being – backup images can be tapes shipped by truck from primary site to recovery site, or backups replicated by backup appliances.  The focus will be on replication of “immediately” usable data sent over the network from primary site to recovery site.

Types of replication:

  • Application Replication – Using an application to move data.  This could be Active Directory, Microsoft SQL Server, Oracle DataGuard, or other application specific methods
  • OS Replication – This typically uses some application within the Operating system and used to ship files … this could be something as simple as rsync
  • Hypervisor Replication – For those that run virtualized environments, replication occurs at the hypervisor layer.  This could be VMware vSphere Replication
  • Storage Replication – This is when an array replicates the data.


With Application Replication – one can be confident that the application is replicated in a usable form.  Without quiescing the applications, applications may not be entirely recoverable.  (quiescing is the process of placing the application in a state that is usable – usually, this consists of flushing all data in RAM to disk).  However, the question that inevitably comes up is:  “What about all the other systems I will need at the recovery site?”

With OS Replication – the application that moves the data, tends to be fairly simple to understand:  copy this directory structure (or folder) to the receiving system and place in such and such folder.  However, with this methodology, the question that arises is:  “What about the registry and open files?”  If the process just sweeps the filesystem for changes, what happens to files that are open by some other application (i.e. not quiesced — could be a problem with database servers)?

With Hypervisor Replication (for virtualized environments) – The hypervisor sits underneath the guest operating system(s).  This has the advantage of catching the writes from the guest.  To write to disk, the guest has to write to the hypervisor and then the hypervisor writes to disk.  The data can then be duplicated at this level.  For this method, the inevitable question is:  “What about by physical machines that haven’t been virtualized?”  One can still have issues with non-quiesced apps.

With Storage Array Replication – The array sends to its peer at the remote site.  The advantage of this method is that it can handle virtual or physical environments.  The disadvantage is that an array on the remote side must match the protected site.  This can be problematic, as frequently, sophisticated arrays with this technology tend to be costly.

Regardless of the methodology, there has to be coordination between the two sides.  The coordination tends to be handled by having whichever facility at each end provided by the same vendor:  e.g. SQL server to SQL server, VMware vSphere Replication appliances to vSphere Replication appliances, Storage vendor like NetApp to NetApp.  [I’m ignoring Storage Hypervisors and just bundling those into the Storage Array layer.]

Every method has its own set of rules.  Imagine I had a document where I wanted a copy across town.  I could:

  1. walk it over myself – but I would have to follow the rules of the road and not cross against red lights
  2. e-mail it – but I would need to know the e-mail address and the recipient would need to have e-mail to receive it.
  3. fax it – but I would need a fax machine and the recipient would need a fax machine and I would need to dial the number of the recipient’s fax machine.
  4. send it with US postal – but I would need to fill out the envelope appropriately, with name, address, city, state.
  5. call the recipient and dictate it – but I would need the phone number and the recipient needs a phone.  I would also need to speak the same language as the person on the other end of the phone.

These simplistic examples that we use every day, are used here to illustrate that there are certain norms that we are so familiar with that we ignore that they are norms.  But, without these norms, the communication could not be done.  Likewise, users should not be surprised that each data replication method has its own requirements and that the vendors are picky as they wish to define the method of communication.


Rev 1.0 06/25/13