vNUMA CPU Alignment – doing it right

As I’ve said before, the majority of our VMware environment is running on Cisco UCS blade servers, and the majority of these are running dual hex-core CPUs. With the broad spectrum of Operating Systems and applications running across our many hundreds of VMs, there are inevitably many, many VMs with multiple vCPUs.

This shows the CPU Ready/Usage stats before and after re-aligning vCPU configuration from single core-multiple sockets, to multiple core-single socket

NUMA, in a nutshell, utilises the host CPU’s local memory bus to allow faster memory access time for workloads on that specific CPU. This is particularly important for latency sensitive workloads. vNUMA is VMware’s implementation of utilising NUMA to reduce memory latency for VMs running on ESXi.

When looking at CPU performance issues with a guest VM, no host level contention for CPU resources existed. Through the VM performance graphs in the vSphere client, CPU Ready times could be seen to be spiking often, this was on a VM with multiple CPUs set to 2 x socket and 8 x cores (16 vCPU).

This shows the CPU Ready/Usage stats before and after re-aligning vCPU configuration from single core-multiple sockets, to multiple core-single socket; a marked difference

 

This problem is fairly well documented, and is detailed in VMware KB 1026063; when utilising the NUMA features of VMware, it is important to configure the vCPU layouts for your VMs to align with the physical characteristics of your hosts if possible; this will help to guarantee that the VM guest can be vMotioned to other hosts in the cluster, with different physical CPU configurations. A better solution, where identical CPU configurations can not be guaranteed across all your hosts, is to define all vCPUs as x sockets with a single core.

In this case our physical CPU architecture is 2 sockets x 6 cores, and the VM was configured with 1 socket x 8 cores. This prevents the hypervisor and guest OS from completely utilising either one, or both sockets in our physical host, and is therefore missing out on the speed benefits which NUMA can bring. And can be exhibited as high or consistently high ready States for this VM guest.

There is a great VMware blog article about the performance implications of vNUMA design selection here, which echoes the VMware Best Practices guide, stating that the best way to approach this is to set your vCPU configuration ‘flat and wide’. this means that if your VM requires 8 cores, then configure your vCPU with 8 sockets with 1 core each, rather than 2 quad-core or 4 dual-core sockets.

This allows the vNUMA technology to balance the load as it best sees fit and should prevent CPU Ready spike issues. There are of course edge cases, where specific software licensing may force you to use as few sockets as possible, in which case alignment to physical host CPU should always be attempted, in the case above, the VM could have been configured to 1 socket x 6 cores, or 2 sockets x 6 cores. Be aware, that stepping outside of the ‘flat and wide’ model will prevent vNUMA from doing its job, and will bow to your judgement of vCPU configuration; this means you had better have got it right!

Leave a comment