Example: Back in the early 2000s, NetFlix allowed renters to have 3 DVDs at a time, but some customers churned those 3 DVDs more frequently than average and more frequently than Netflix expected. So, they throttled those customers and put them at the back the line. (dug up this reference). This also appears to have happened in their streaming business.
Another example: Your web site gets linked on a site that generates a ton of traffic (I should be so lucky). This piece says that the Drudge Report sent 30-50,000 hits per hour bringing down the US Senate’s web site. At 36,000, that is an average of 10 per second.
Network Bandwidth tends to be the resource. Another example from AT&T: As a service provider, this piece says that 2% of their customers consume 20% of their network.
There are non-technical examples as well. The all-you-can-eat buffet is one. Some customers will consume significantly more than the average. (Unfortunately, I can’t find a youtube link to a commercial that VISA ran during the Olympics in the 80s or 90s where a sumo wrestler walks into a buffet – if you can find it for me, please reply).
Insurance customers deal with this as well. They try to spread out the risk so that if an event were to occur (e.g. a hurricane), they don’t want all their customers in a single area. Economists call this “adverse selection”. “How do we diversify the risk so that those that file claims, aren’t the only ones paying in?”
How does this deal with computing? Well, quotas are an example. I used to run systems with home directory quotas. If I had 100GB, but 1000 users, I couldn’t divide this up evenly. I had about 500 users who didn’t need 1MB, but I had 5 that needed 10GB. For the 500 users that did need more than 1MB, they needed more than an even slice.
So, the disk space had to be “oversubscribed”. I then could have a situation where everyone stayed under quota, but I could still run out of disk space.
Banks do this all the time. They have far less cash on-hand in the bank, than they have deposits. Banks compensate by having insurance through the Fed which should prevent a run on the bank.
In computing, this happens on network bandwidth, disk space, and compute power. At deeper levels, this deals with IO. As CPUs get faster, disks become the bottleneck and not everyone can afford solid state disks to keep up with the IO demand.
The demand in a cloud computing environment would hopefully follow a normal distribution (bell curve) for demand. But, that is not what always occurs. Demand tends to follow an exponential curve.
As a result, if the demand cannot be quenched by price increases, then throttling must be implemented to prevent full consumption of the resources. There are many algorithms to choose from when looking at the network, likewise there are algorithms for the compute.
Given cloud architecture which is VM on a host connected to a switch connected to storage which has a disk pool of some sort, there are many places to introduce throttles. In the image below which is uses a VMware & NetApp vFiler environment (could be SVM aka vServer as well) serving, there is VM on ESX host, connected to Ethernet switch, connected to Filer, split between disk aggregate and a vFiler which then pulls from the volume sitting on the aggregate, and then has the file.
Throttling at the switch may not do much good. As this would throttle all VMs on an ESX host or if not filtering by IP, all ESX hosts. Throttling at the ESX server layer again, affects multiple VMs. Imagine a single customer on 1 or many VMs. Likewise, filtering at the storage layer, specifically, the vFiler may impact multiple VMs. The logical thing to do for greatest granularity would be to throttle at the VM or vmdk level. Basically, throttle at the end-points. Since a VM could have multiple vmdks, it is probably best to throttle at the VM level. (NetApp Clustered OnTap 8.2 would allow for throttles at the file level). Not to favor NetApp, other vendors (e.g EMC, SolidFire) who are introducing QoS are doing these at the LUN layer (they tend to be block vendors).
For manual throttling, some isolate the workloads to specific equipment – this could be compute, network, or disk. When I used to work at the University of CA, Irvine and we saw the dorms coming online with Ethernet to the rooms, I joked that we should drive their traffic through our slowest routers as we feared they would bury the core network.
The question would be what type of throttle algorithm would be best? Since starving the main consumers to zero throughput is not acceptable, following a network model may be preferred. Something like a weighted fair queueing algorithm may be the most reasonable, though a simple proposition would be to revert back to the quota models for disk space – just set higher thresholds for many which will not eliminate every problem, but a majority. For extra credit (and maybe a headache) read this option which was a network solution to also maximize throughput
Jim – 11/03/13
(I don’t accept general LinkedIn invites – but if you say you read my blog, I’ll accept)