Simplivity Storage – hunting VMs when low on storage.

With HPE Simplivity, each VM has storage mirrored on another specific host in the cluster. As an example, in a 3 host VMware cluster, a VM on ESXi host #1 will have a mirrored copy of the storage on ESXi host #2, but no other data on ESXi host #3. Unlike other HyperConverged Infrastructure products in the marketplace, Simplivity mirrors the VM as opposed to striping the VM across other nodes in the cluster.

This has its advantages and disadvantages.

One result is that each node in the cluster ends up with different storage consumption, as the blend of VMs on a host vary. In the following example:

  • ESXi Host #1 could have VMs: a, d, e, f
  • ESXi Host #2 could have VMs: a, b, c, e
  • ESXi Host #3 could have VMs: b, c, d, f

As a result, at any given time, 2 nodes will be closer to full than the remainder. “Closer” could be significant or insignificant, but the point is relevent to the script that I’ll include below.

The Simplivity storage environment is controlled by what are call Omnistack Virtual Controllers (“OVCs” for short). Each ESXi host has a dedicated OVC, but the OVCs in a VMware cluster work together.

(I don’t know the Simplivity interface from Hyper-V, so what follows is VMware specific.) From vSphere, it is cumbersome to determine where Simplivity stores copies of a VM. Clearly, if ESXi Host 1 is nearing full, one will see a VMware alarm and one could go to the host and see the VMs with a primary copy of storage on that host – but what about a VM with a secondary (redundant mirrored) copy? There would have to be some deciphering of the DRS rules.

Simplivity offers a command line tool through the OVC which will list all the VMs in the cluster and where the primary and secondary copies are stored and the VM size: dsv-balance-manual

The drawbacks of this tool are: first, one can only see the size of the VM, one cannot see the size of the VM AND all its associated backups. Secondly, it does not report any remote backups copied from another cluster to the hosts, if any.

When storage runs tight, removing backups is the most likely path forward. Having a total size for a VM which would include its associated backups would be very helpful – but with de-duplication and compression across VMs on the host while backups expiring at different intervals this becomes very difficult to calculate and would only be valid in real time.

The first step to gaining space would be to search the backups on the cluster to determine if there are any backups which lack an expiration date, and remove as necessary.

The second step might be to identify which VMs are shared on the 2 most full hosts.

It is fairly easy to eyeball the output to dsv-balance-manual, but when one runs it often and if there are many VMs, human error can kick in. I wrote the following CLI command pipeline to do this:

node=(`sudo /var/tmp/build/dsv/dsv-balance-show --ShowNodeIndex | sed 's/\(.B \)/ \1/;s/\.\([0-9][0-9]\) TB/\10 GB/' | awk '/^\| Node [0-9]/ {print $3,$15}' | sort -nr +1 | awk 'BEGIN {z=10^6}{b=a;a=$1;if ($1 < z) {z=$1}} END {print a-z+1,b-z+1}'`) ; cl=(`svt-federation-show | awk -F"\| " '/Alive/ {if ($3 ~ /^[a-zA-Z]/) {x=$3};{a=x} ; if ($4 ~ /^[a-zA-Z]/) {y=$4};{b=y} ; print a,b,$9}' | grep \`ifconfig eth0 | awk '/inet/ {print $2}'\``) ; sudo /var/tmp/build/dsv/dsv-balance-manual --datacenter ${cl[0]} --cluster ${cl[1]} > /dev/null ; awk -F, -v n=${node[0]} -v m=${node[1]}  '/\]/ {offset=2 ; if (($(n+offset) ~ /s|p/) && ($(m+offset) ~ /s|p/)) print $(NF-2),$(NF-1)}' /tmp/balance/replica_distribution_file_${cl[0]}.csv | sort -n 

Had I known it would be so long, I probably would have written a script, but then the script would have to be pushed to all OVCs on all the hosts and with the Simplivity upgrade procedure, the OVCs would be wiped out and the scripts re-written.

Documentation:

This needs to be run on an OVC in the cluster that is short on space (i.e. won’t work across clusters).

Create the node array – 2 entries with the 2 nodes in the cluster with the least available space.

node=(`

Run Simplivity command as root to determine which nodes lack space, this will include space remaining (as opposed to consumed).

sudo /var/tmp/build/dsv/dsv-balance-show --ShowNodeIndex

Add a space between digits of storage and label of storage, and convert TB to GB by removing the decimal point and adding a zero

| sed 's/\(.B \)/ \1/;s/\.\([0-9][0-9]\) TB/\10 GB/'

Find the lines with only details about the nodes (throw out the headers) and only print the IP address and the storage consumed (the label above is now discarded).

| awk '/^\| Node [0-9]/ {print $3,$15}'

Sort numerically by storage consumed in descending order.

| sort -nr +1

Print the last 2 entries and reset the node number so that it counts from 1 – the output from the earlier Simplivity command depending on retired equipment, might not start from 1. (Unsure how this behaves if run on a cluster with less than 1 host – but one would not need to run this script if there was only 1 host).

| awk 'BEGIN {z=10^6}{b=a;a=$1;if ($1 < z) {z=$1}} END {print a-z+1,b-z+1}'

This completes the array. The contents of the array are 2 numbers reflecting the nodes which have the least space available.

`);

Set the datacenter and cluster variables. This is a lot of code to include what is already known, but will reduce human error of misspellings. Set the “cl” array (cluster).

cl=(`

Run the Simplivity command to show all the nodes in the federation.

svt-federation-show

Find only the lines that include nodes. Given the output from above, if the datacenter field (#3) is empty, print what was in the line before and if the cluster field (#4) is empty, print what was in the line before. Finally, only print datacenter, cluster, and management IP.

| awk -F"\| " '/Alive/ {if ($3 ~ /^[a-zA-Z]/) {x=$3};{a=x} ; if ($4 ~ /^[a-zA-Z]/) {y=$4};{b=y} ; print a,b,$9}'

Search the output of the above, with output of what follows.

| grep \`

Determine the IP of the management IP of this OVC.

ifconfig eth0 | awk '/inet/ {print $2}'\`

Finalize the array with datacenter, cluster, and IP (the latter won’t be used).

`) ;

Run the Simplivity command to list the VMs and add the datacenter and cluster information so that it can run unattended, dump the output to /dev/null, as an output file will be left behind.

sudo /var/tmp/build/dsv/dsv-balance-manual –datacenter ${cl[0]} –cluster ${cl[1]} > /dev/null ;

sudo /var/tmp/build/dsv/dsv-balance-manual --datacenter ${cl[0]} --cluster ${cl[1]} > /dev/null ;

Parse the output file: /tmp/balance/replica_distribution_file_<cluster name>.csv Use the 2 variables, m & n, to represent the 2 nodes to search for. The CSV is offset by 2 other data points before the node data is included and a “p” or “s” for primary or secondary copy. The number of nodes will determine the number of columns. The 2nd to last column is the VM name shown by $(NF-1). The 3rd to last column is the VM size.

awk -F, -v n=${node[0]} -v m=${node[1]} '/\]/ {offset=2 ; if (($(n+offset) ~ /s|p/) && ($(m+offset) ~ /s|p/)) print $(NF-2),$(NF-1)}' /tmp/balance/replica_distribution_file_${cl[0]}.csv

Then sort the output numerically in ascending fashion, so the largest VMs are at the bottom.

| sort -n 

Given all this, it could be scripted to look cleaner and could be made more tidy.

Problem calculating workloads on Storage, in this case NetApp

Double Black Diamond

With a centralized storage array, there can be front-side limitations (outside of the array to the host or client) and back-side limitations (the actual disk in the storage array).

The problem that occurs is that from the storage array point of view, the workloads at any given moment in time are random and from the array the details of the workloads are invisible.  So, to alleviate load on the array has to be determined from the client side not the storage side.

Take for example a VMware environment with NFS storage on a NetApp array:image

Each ESX host has some number of VMs and each ESX host is mounting the same export from the NetApp array.

 

Let IA = The Storage Array’s front side IOPS load.
Let hn(t) = The IOPS generated from a particular host at time t and n = number of ESX hosts.

 

The array’s front side IOPS load at time t, equals the sum of IOPS load of each ESX host at time t.

IA(t) = Σ hn(t)

 

An ESX host’s IOPS load at time t, equals the sum of the IOPS of each VM on the host at time t.

h(t) = Σ VMn(t)

 

A VM’s IOPS load at time t, equals the sum of the Read IOPS & Write IOPS on that VM at time t.

VM(t) = R(t) + W(t)

 

The Read IOPS are composed of those well formed Reads and not well formed reads.  “Well formed reads” are reads which will not incur a penalty on the back side of the storage array.  “Not well formed reads” will generate anywhere between 2 and 4 additional IOs on the back side of the storage array.

Let r1 = Well formed IOs

Let r2 = IOs which cause 1 additional IO on the back side of the array.

Let r3 = IOs which cause 2 additional IOs on the back side of the array.

Let r4 = IOs which cause 3 additional IOs on the back side of the array.

Let r5 = IOs which cause 4 additional IOs on the back side of the array.

Then

R(t) = ar1(t) + br2(t) + cr3(t) + dr4(t) + er5(t)

Where a+b+c+d+e = 100% and a>0, b>0, c>0, d>0, e>0

and

W(t) = fw1(t) + gw2(t) + hw3(t) + iw4(t) + jw5(t)

Where f+g+h+i+j = 100% and f>0, g>0, h>0, i>0, j>0

Now for the back side IOPS (and I’m ignoring block size here which would just add a factor into the equation of array block size divided by block size).  The difference is to deal with the additional IOs.

R(t) = ar1(t) + 2br2(t) + 3cr3(t) + 4dr4(t) + 5er5(t)

and

W(t) = fw1(t) + 2gw2(t) + 3hw3(t) + 4iw4(t) + 5jw5(t)

Since the array cannot predetermine the values for a-i, it cannot determine the effects of an additional amount of IO.  Likewise it cannot determine if the host(s) are going to be sending sequential or random IO.  It will trend toward the random given n number of machines concurrently writing and the likelihood of n-1 systems being quite while 1 is sending sequential is low.

Visibility into the host side behaviors from the host side is required.

 

Jim – 10/01/14

@itbycrayon

View Jim Surlow's profile on LinkedIn (I don’t accept general LinkedIn invites – but if you say you read my blog, it will change my mind)

VMware Linked Clones & NetApp Partial writes fun


Double Black Diamond
NetApp OnTap writes data in 4k-blocks. As long as the writes to NetApp are in 4k increments, all is good.

Let me step back. Given all the fun that I’ve experienced recently, I am going to alter my topic schedule to bring this topic forward while it is hot.

One more step back: The environment consists of a VMware ESX hosts getting their storage via NFS from NetApp storage. When data is not written to a storage system in increments matching its block size, then misalignment occurs. For NetApp, that block size is 4k. If I write 32k, that breaks nicely into 8 4k blocks. If I write 10k, that doesn’t as it ends up being 2 and a half.

20130830-222125.jpg

The misalignment problems has been well documented. VMware has a doc. NetApp has a doc. Other vendors (e.g. HP/3PAR and EMC) reference alignment problems in their docs. The problem is well known – and easily googled. With misalignment, more read & write operations are required because the underlying block is not aligned with the block that is going out to storage.

And yay! VMware addresses it in their VMFS-5 file system by making the blocks 1MB in size. That will divide up nicely. And Microsoft, with Windows 2008, they changed the starting block which helped alignment.

So, all our problems are gone, right??

NO.

VMware introduced linked clones which have a grain size of 512 (see Cormac Hogan’s blog)

Once this issue is discovered, you end up reading more of Cormac’s blog, and then maybe some of Duncan Epping‘s, and maybe some of Jack McLeod, not to mention Knowledge Base articles from both VMware & NetApp. The recommended solution is to use VAAI and let the NetApp handle clones on the backend. And these 512-byte writes are technically “partial” writes and not “misaligned”.

If everything is aligned, then the partial writes require 1 disk read operation (of 4k), an instruction to now wedge the 512 packet in appropriately to the 4k, and 1 write back out. If misalignment exists, then it requires twice the IO operations.

However, if you look at nfsstat -d, you’ll notice that there are a whole bunch of packets in the 0-511 range. Wait! I have partial writes, those show up in the 512-1k. What are all these?

At this point, I don’t entirely know (gee Jim, great piece – no answers), but according to VMware KB 1007909 VMware NFS is doing 84-byte (84!?) writes for NFS locking. Given the count in my 1-511 bytes, NFS locking can’t account for all of those – but what does this do to NetApp’s 4K byte blocks?

Jim – 08/30/13
@itbycrayon

View Jim Surlow's profile on LinkedIn (I don’t accept general LinkedIn invites – but if you say you read my blog, it will change my mind)