Ensuring servers provide consistent performance is a primary goal for all infrastructure management services . A large portion of our servers are in a virtualized environment, and the additional complexity involved there can present some challenging performance issues. The common solution of throwing faster CPUs and more RAM at performance problems may resolve most of your issues, but in many cases, it takes some deeper analysis to uncover some not so obvious bottlenecks.
Here are 10 pitfalls I’ve encountered that may be negatively impacting your VMware environment:
- VMware tools. Yes, this is a very obvious item. The VMware tools not only provide an optimized NIC driver, but it more importantly includes a memory ballooning driver. It will encourage your guest to swap out any inactive memory pages — which can be very useful, particularly for over-committed hosts. The pitfall I frequently run into is that our Linux machines are patched and rebooted on a regular basis. Some of these updates include a new kernel, and when that is the case, VMware tools need to be rebuilt. This sounds like a good candidate for a custom Nagios plugin! The plugin could do an lsmod and make sure the VMware modules are present.
- Storage tradeoffs. In a perfect world, we want large, fast, inexpensive disks in our storage array. Large disks, in the 2, 3, and 4TB range typically are limited to the SATA variety. Conversely, building a pure SSD based storage array could easily run you into the $60k range for only 10TB of space. SAS is a great middle ground. 10K RPM drives are now available in 900GB 2.5″ form factors, so density is a plus there as well. Based on my experience, slow storage is one of the most common bottlenecks. A 10 spindle SATA array with a quality RAID controller may provide disappointing results when coupled with an intense workload such as virtualized databases. SSD caches can be implemented in a couple of ways to help boost performance. From VMware’s perspective, vSphere 5 now lets you migrate VM swap files onto SSD disks. From an array’s perspective, RAID controllers may feature SSD caching as well. One example is Adaptec’s maxCache feature.
- Cores vs Clock Speed. Back to my comment about throwing more CPUs at a performance problem — there are cases where more is not necessarily better. It’s important to best match your CPU type for a given workload. For instance, if your VM workload consists of a few single threaded applications, you will want the fastest CPUs available — not more CPU cores. However, if your VM workload consists of something like a virtual desktop infrastructure (VDI), you’ll likely care more about the total number of CPU cores available to the host.
- Host density. While it may be great to tout the “consolidation ratio” you’ve achieved to upper management, the reality is that you need to be prepared to have a host failure at some point or another. When that moment arrives, assuming your remaining hosts even have the spare capacity, how quickly can the failed machines recover? When pricing out a new environment, perhaps it makes sense to look at reducing the specs of several hosts slightly so that an additional one can fit in the budget.
- Network bandwidth . Most servers nowadays include two gigabit Ethernet interfaces, sometimes four. Two will be enough to get you by, but it is not ideal. Consider a situation where you have management and VMotion traffic on NIC A, and VM Network traffic on NIC B. You could potentially lose management access to your host if VMotion traffic causes network saturation. For new installations, consider migrating to 10Gb ethernet, which should provide more than adequate bandwidth for all traffic combined.
- Lack of Capacity planning. For some reason or another, when customers hear the term virtual, they assume that there’s no incremental cost involved in adding additional VMs. In actuality, we know that nothing is free. That and the fact that VMs are extremely trivial to provision, we’re frequently in a position to give in to requests easily without giving them much thought. Instead of completely pushing back on the customer, perhaps make it a policy that each new virtual machine that comes online should have a capital budget associated to it. When host density reaches a certain point, the budget should have enough to cover a new host along with the supported storage, licensing, and other infrastructure costs.
- Inventory. This goes hand in hand with capacity planning. Know what VMs live in your environment, who owns them, and what applications are tied to them. Quarterly or even yearly queries out to your customers may reveal that a significant number of VMs are associated to retired applications or cancelled projects.
- Resource Pools. Configuring resource pools can be tedious and time consuming, but they may make your life much easier in the long run. If you find it difficult to carve out resource pools based on departments or functional groups, it may be a quick hit to simply create a “prod” and “dev” pool. Non-critical development or test machines can be pooled together with a smaller amount of resource shares. Additionally, you could leverage host affinity so that critical machines run on your newer, faster hosts.
- Lack of visibility. Visibility is an important part of ensuring consistent performance. Often times, a customer will mention to me that a VM “feels” slower. In order to make an accurate comparison, we need historical metrics. While the built-in vSphere performance tools are great, I find myself looking immediately at Veeam One instead. Veeam provides a nice consolidated view of all your vSphere instances with easy to pinpoint dashboard graphs.
- Expectations. Given the hardware you’ve been blessed with, it can only perform so well. Keeping expectations inline may be all there is to the solution. For new projects, perform not only functional testing of your application, but also a performance qualification. If possible, do the same for P2V conversions. You may uncover a potential performance issue even before going live.
Next Steps:
- Contact SPK and Associates to see how we can help your organization with our ALM, PLM, and Engineering Tools Support services.
- Read our White Papers & Case Studies for examples of how SPK leverages technology to advance engineering and business for our clients.
Michael Solinap
Sr. Systems Integrator