I had a light bulb moment as I was sitting in fellow ePlus employee Don Mann's session at VMworld.
A little background is needed first. In vSphere version 4.0 we didn't really have a need to control the traffic in 10GB connections. Even if all the traffic types were combined into a single connection with no traffic management, you rarely ran into contention on the link. vMotion was the most likely to act up due to the "bursty" nature of the traffic pattern (hit the connection really hard for a few seconds until the vMotion is complete and then settle down) but this was limited because vMotion in vSphere 4.0 was capped at two concurrent vMotions at about 2.6 Gbps each for a total of 5(ish) Gbps maximum for vMotion. If you assume a little over 9 Gbps usable capacity (the rest lost to protocol overhead) on a 10GB link you still have room for other traffic and you never burst high enough to saturate the network.
Then, along came vSphere 4.1....
vSphere 4.1 introduced significant performance enhancements to vMotion over 4.0. vSphere 4.1 increases the number of concurrent vMotions to eight in a 10GB environment and the speed has been increased to 8 Gbps.
When I heard this, a light bulb went off in my head and I've been poking at this idea with a stick for awhile now. I've asked around in the community over the last few days and there seems to be confusion over the numbers. Does that mean eight vMotions, each one at 8Gbps for a total of 64 Gbps maximum or does that mean eight concurrent vMotions consuming a total of 8Gbps maximum. I don't have a definitive answer to this question but tests I have seen conducted point to EACH vMotion consuming up to 8Gbps each. If this is true, anything above ONE vMotion at a time without some form of traffic control may not be a good thing!
Does it matter if I'm utilizing 64 Gbps for vMotion or 8Gbps for vMotion??
The more I think about it, it really doesn't. Let's assume best case for a second and say that eight vMotions will consume a total of 8Gbps (I don't think it works this way but I'm being an optimist). If vMotion can consume a maximum of 8Gbps of a 10GB pipe, you will need to design around this fact. Some form of Traffic Shaping and/or Quality of Service to manage the traffic will be necessary in 4.1 when it was often considered optional previously.
I did a little digging and the issue is confirmed in VMware's NetIOC Best Practices document. To summarize, your results may vary (and not in a good way) if you aren't putting some form of control on your vMotion traffic in conjunction with 10GB links.
Oh, before I get a bunch of comments telling me this: I'm picking on vMotion here but you could just as easily perform a global replace in this article with (your favorite chatty and/or spikey traffic type) for vMotion in this article. The concepts to solve network congestion are the same.
How do we solve this issue?
There are two main ways to solve bandwidth contention. One is to place a cap on the amount of traffic vMotion can use. This is often referred to as rate limiting the links. The second is to give priority based on a weighted system that kicks in when contention takes place. This is called Quality of Service or QoS. With QoS, everyone gets some bandwidth, but no one is allowed to take over completely and priority is given to critical traffic. I wrote an article on the concepts in the past here and Brad Hedlund wrote a great article on the concepts with cool Flash animations here. Don't get hung up that we both wrote about HP and Cisco, the concept of rate limits vs QoS still stands.
In my opinion a QoS or shares based priority model is much more effective to control this traffic. This allows for better utilization of the bandwidth and provides a more flexible alternative to rate limiting.
How do Rate Limits and QoS fit into vSphere?
Here is a simple graphic to illustrate the virtual switch options in vSphere today:
This concludes the first article in this series. I will explore the rate limiting options (vSS and vDS with 4.0) in the next article and conclude with the QoS based options (vDS with 4.1 and Cisco 1000v).
Lastly, a big Thank You!! to the following people for their help on the article and for allowing me to bounce questions off them: Don Mann, Ron Fuller, Joe Onisick, Sean McGee, Brad Hedlund & Stevie Chambers
Do you any information to add? What are your thoughts? Please leave a comment!