VM Swap File location considerations

When a VM is powered on, a .vswp (Virtual Machine swap) file is created (note there is also a vmx swap file which gets created in the same location as the VM, this is seperate from this discussion, but is described here), its size is the memory allocation for the VM less any reservation configured. If there is not sufficient space in the configured swap file location to create this file then the VM will not power on. The use of this file for memory pages is a last resort, and will be considerably slower than using normal memory, even if this is compressed or shared. It will always be possible that you get in a situation where memory contention is occurring, and that the use of the swap file begins, to prepare for this a system design should consider the location of swap files for VMs. Below I discuss some of the considerations which should be made when placing a VM swap file:

  • Default location for swap file is to store it in the same datastore as the VM, this presents the following problems:
    • Performance – it is unlikely that the datastore the VM sits in is on top tier storage, or limited to a single VM. This means that the difference in speed between memory IO and the swapping IO once contention occurs will be great, and that the additional IO this swapping produces could well impact other workloads on the datastore. If there are multiple VMs sharing this datastore, and all running on the host with memory contention issues, then this will be further compounded and could see the datastore performance plummet
    • Capacity – inevitably, administrators will keep chucking workloads into a datastore, and unless Storage DRS with datastore clusters is being used, or the administrators are pro-active in balancing storage workloads, there will come a time when a VM will not power up due to insufficient space to create the .vswp file. This is particularly likely after a change to the VM configuration such as adding more disk or memory

VM swap file location can be changed either at the cluster, or host level. When choosing to move this from the default, the following should be considered:

  • Place swap files on the fastest storage possible – if you can place this on flash storage then fantastic, this will not be as quick as paging to/from memory, but it will be many magnitudes better than placing it on spinning disk
  • Place swap files as close to the host as possible – the latency incurred by traversing your SAN/IP network to get to shared storage will all impair guest performance when swapping occurs. Be aware that although the default location can be changed to host local storage (which will probably give the best performance of the host has internal flash storage), this will impair vMotion performance massively, as the entire .vswp file would need to be copied from the source host to the destination host’s disk during the vMotion activity
  • Do not place the .vswp on replicated storage – as with location selection for guest OS swap files, there is no point on placing the file on replicated storage; if the VM is reset or powered off then this file is deleted. If your VMs are on storage which is replicated as part of its standard capability then the .vswp files should definitely be located elsewhere

 

In terms of configuring the location, as stated above, this is set at either a VM, host or cluster level, if this is inconsistent across hosts in a cluster then again this may impact vMotion times as the VM migrates from a host with one configured location to another with a different location. As with most settings which can be made at the cluster level, consistency should be maintained across the cluster unless this is not possible. Bear in mind though, that having vswp consistent across the cluster, and defined to be a single datastore, could lead to high IOPS on this datastore should cluster wide memory contention occur., especially with large clusters.

As stated at the beginning of this article, swap files are sized based on VM memory allocation less reservations size. By right sizing VMs, and utilising reservations, swap file sizes, and usage, can be kept to a minimum, and these planning considerations should take precedence over all others. Hopefully memory contention will never be so bad that swap will be required, but when the day does come it is good to be prepared, by making informed, and reasoned decisions early on.

Advertisements

VMFS Extents – to extend or not to extend, that is the question

Extents allow disk presented to a vSphere system to be added to VMFS datastore to extend the file system, this aggregates multiple disks together and can be useful in a number of scenarios. Recently I saw problems where extents were being used spanning two storage systems; one of the storage systems had a controller failure which caused SCSI reservation issues on one of the LUNs making up the extent and this caused the entire datastore to go offline.

In this article I want to discuss some of the benefits and potential pitfalls in using VMFS extents in vSphere environments. Ultimately this is an available, supported, and sometimes useful feature of vSphere but there are some limitations or weaknesses that using this can bring.

Advantages:

  • Using extents allows you to create datastores up to the maximum supported by vSphere for pre-VMFS-5 datastores. It can be useful to create large datastores for the following reasons:
    • There may be a requirement to natively present a VMDK which is larger than the maximum LUN size available on your storage system. For example, if 2TB is the largest LUN you can present, but you need a 4TB disk for the application your VM is hosting, the aggregation of disks will allow the creation of a VMFS datastore large enough to deliver this without the need to span volumes in the guest OS, or the need to fallback to using something like RDMs which may impinge on other vSphere functionality
    • Datastore management will be simplified with fewer VMFS datastores required. The fewer datastores available, the less an administrator has to keep their eyes on. In addition to this, decisions made in placing VMs is made considerably simpler if there are fewer choices
  • Adding space to a datastore with capacity issues; in a previous role we were constrained by storage space more than any other resource, this meant that on both the storage system (NetApp FAS2050 with a single shelf of storage), and at the VMFS level, the design left little to no room to extend a VMDK should it be required. If we did need to add space to VMDKs, we had to extend the volume and LUN by the required amount on the filer, and add a small extent to the datastore in vSphere

Disadvantages:

  • Introduces a single point of failure; whether you are aggregating disks from one or multiple storage systems, by adding extents to a volume the head extent in the aggregated datastore (the first LUN added to the datastore) becomes a single point of failure, if any of the LUNs should become unavailable then VMs which have any blocks whatsoever on the lost LUN will no longer be available
  • Management from the storage side can become more difficult, given that there may be multiple LUNs, from multiple storage systems now aggregated to form a single datastore, from a storage side it is harder to identify which LUNs relate to which datastores in vSphere, to combat this it is important to document the addition of extents well, and label LUNs accordingly on the storage system
  • If extents are combined which span different storage devices then there may well be a loss in performance

The above is all just based on my experiences, but it seems there are legitimate use cases for choosing to use, or not use extents. My personal preference would be to present a new larger LUN where possible, formatting this in VMFS, and using Storage vMotion to migrate VMs to the new datastore. Given that since VMFS-5 introduced GPT as the partitioning method for LUNs, we can now create single extent datastores up to 64TB in size, the requirement for using extents should be diminished. There are often legitimate reasons, especially in older environments, why this is not practical or possible however, and in these cases using extents is perfectly valid.

PowerCLI – getting on the road

Powershell is a tool I have been interested in for a number of years, unfortunately in my previous role I was limited to using VBscript for automation due to running on a purely Windows Server 2003 environment, and one sans Powershell v1.

When I started in my current role, I was excited to start using Powershell to replace VBscript as my primary scripting language.I have found a number of resources which are great for learning Powershell, not least of which is the Powershell In A Month Of Lunches book by Don Jones. I started reading but ended up learning through just getting my hands dirty.

So, I rolled up my sleeves, and started using Powershell to do small things. Once I felt a little more comfortable with it, I decided to download VMware PowerCLI and give that a go. Never did I realise that I needed a tool so badly until I had it, I was blown away by what PowerCLI allowed me to do.Since then I have built up a good library of scripts which allow me to report on, and automate some of the daily VMware grind I experience in my work. Examples of scripts I have created, which provide functionality not available natively in vSphere are:

  • A script to go through a vCenter instance, identifying the Path Selection Policy for all disks, flagging and offering to correct where these are not correct
  • Scripts to list all VMs with RDMs on them, this helped us to plan for patching our hosts
  • A daily check script which exports all vCenter alarms to an HTML page to allow alarms across multiple vCenters to be checked quickly

I decided to list some of the resources which can really help to get going with PowerCLI, hopefully this will assist someone looking to learn to use this toolset:

  • Powershell In A Month Of Lunches book by Don Jones – this book is great for getting the basics of Powershell down
  • Notepad++ – there are other tools available, some recommend Quest’s PowerGUI, but this is what I use for PowerCLI scripting day to day. It is only available on Windows but if you are using PowerCLI then you are using Windows so this shouldn’t be too much of a problem
  • VMware’s PowerCLI Documentation site – this site is invaluable for the explanations of specific cmdlets and the available parameters
  • VMware vSphere PowerCLI Reference by Luk Dekens and Alan Renouf – this was the first book released on PowerCLI and has some great examples of what you can do with the tool

I was watching this interesting interview of Alan Renouf by Mike Laverick on his Chinwag Reloaded podcast, Alan now works on the PowerCLI team at VMware and has been a key player in the community of PowerCLI since its inception. I thought I would include some of this information here to show the developments to PowerCLI which are around the corner.

Here they discuss a new feature relating to PowerCLI (and available here), which started as a VMware fling and has been developed by the internal team responsible for the vSphere Web Client. The primary new feature is called ‘PowerActions’ and it will allow your library of scripts to be stored within vSphere, and make them accessible to other administrators using the Web Client as you desire. This was of interest to me as I have been looking at git style repositories to allow for version control and peer review which our company can use to store, share, and manage our ever growing library of PowerCLI scripts.

Further to this, and exciting for people running an OS other than Windows, or in an environment where installation of PowerCLI is not possible, is that a shell will be available through the Web Client. This will be a great help when troubleshooting and wanting to run some of the more useful PowerCLI commands without having to fire up and connect your usual shell.

There is even more new stuff coming as well; the ability to create new menu items in the Web Client which will run your scripts and return the output in a message box, this will mean that for common tasks like reporting or bespoke automation not present in vSphere, you can present these simply to other administrators or users of vSphere and they can utilise the additional functionality these provide without having to have any knowledge of PowerCLI.

Where other vendors have provided limited API integration through their own PowerShell libraries, it is great to see VMware throwing a lot of time and money into ensuring their API delivers what customers want, and that a growing and helpful community has developed around this which allows a keen administrator to quickly learn and develop their skills with PowerCLI.

If you are a VMware administrator then PowerCLI is definitely something you should get involved with, and there has never been a better time to do this. I will look to publish links to some of my scripts at some point, as well as discuss some useful PowerCLI cmdlets.

Storage I/O Control – what to expect

Storage I/O Control, or SIOC, was introduced into vSphere back in vSphere 4.1, it provides a way for vSphere to combat what is known as the ‘noisy neighbour’ syndrome. This describes the situation where multiple VMs reside on a single datastore, and one or more of these VMs take more than their fair share of bandwidth to the datastore. This could be happening because a VM decides to misbehave, because of poor choices in VM placement, or because workloads have changed.

The reigning principle behind SIOC is one of fairness, allowing all VMs a chance to read and write without being swamped by one or more ‘greedy’ VMs. This is something which, in the past, would have been controlled by disk shares, and indeed this method can still be used to prioritise certain workloads on a datastore over others. The advantage with SIOC is that, other than the couple of configurable settings, described below, no manual tinkering is really required.

Options available for Storage I/O Control
Options available for Storage I/O Control

There are only two settings to pick for SIOC:

1) SIOC Enabled/Disabled – either turn SIOC on, or off, at the datastore level. More on considerations for this further down

2) Congestion Threshold – this is the trigger point at which SIOC will kick in and start doing its thing, throttling I/O to the datastore. This can be configured with one of two types of value:

a) Manual – this is set in milliseconds and this defaults at 30ms, but is variable depending on your storage. VMware have tables on how to calculate this in their SIOC best practice guide, but the default should be fine for most situations. If in doubt then your storage provider should be able to give guidance on the correct value to choose.

b) Percentage of peak throughput – this is only available through the vSphere Web Client, and was added in vSphere 5.1, this takes the guess work out of setting the threshold, replacing it with an automated method for vSphere to analyse the datastore I/O capabilities and use this to determine the peak throughput.

My experience of using SIOC is described in the following paragraphs, improvements were seen, and no negative performance experienced (as expected), although some unexpected results were received.

Repeated latency warnings similar to the following from multiple hosts were seen, for multiple datastores across different storage systems:

Device naa.5000c5000b36354b performance has deteriorated. I/O latency increased from average value of 1832 microseconds to 19403 microseconds

These warnings report the latency time in microseconds, so in the above example, the latency is going from 1.8ms to 19ms, still a workable latency, but the rise is flagged due to the large increase (in this case by a factor of ten). The results seen in the logs were much worse than this though, sometimes latency was rising to as much as 20 seconds, this was happening mostly in the middle of the night

After checking out the storage configuration, it was identified that Storage I/O Control was turned off across the board. This is set to disabled by default for all datastores and as such, had been left as was. Turning SIOC on seemed like a sensible way forward so the decision was taken to proceed in turning it on for some of the worst affected datastores.

After turning on SIOC on a handful of datastores, a good reduction in the number of I/O latency doublings being reported in the ESXi logs was seen. Unfortunately a new message began to flag in the host events logs:

Non-VI workload detected on the datastore

This was repeatedly seen against the LUNs for which SIOC had been enabled, VMware have a knowledge base article for this which describes the issue. In this case, the problem stemmed from the fact that the storage backend providing the LUNs had a single disk pool (or mDisk Group, as this was presented by an IBM SVC) which was shared with unmanaged RDMs, and other storage presented outside the VMware environment.

The impact of this is that, whilst VMware plays nicely, throttling I/O access when threshold congestion is reached, other workloads such as non-SIOC datastores, RDMs, or other clients of the storage group, will not be so fair in their usage of the available bandwidth. This is due to the spindles presented being shared, one solution to this would be to present dedicated disk groups to VMware workloads, ensuring that all datastore carved out of these disks have SIOC turned on.

We use EMC VNX, and IBM SVC as our storage of choice, recommendations from both these vendors is to turn SIOC on for all datastores, and to leave it on. I can only imagine that the reason this is still not a default is because it is not suitable for every storage type. As with all these things, checking storage vendor documentation is probably the best option, but SIOC should provide benefit in most use cases, although as described above, you may see some unexpected results. It is worth noting that this feature is Enterprise Plus only, so anyone running a less feature packed version of vSphere will not be able to take advantage of this feature.

vNUMA CPU Alignment – doing it right

As I’ve said before, the majority of our VMware environment is running on Cisco UCS blade servers, and the majority of these are running dual hex-core CPUs. With the broad spectrum of Operating Systems and applications running across our many hundreds of VMs, there are inevitably many, many VMs with multiple vCPUs.

This shows the CPU Ready/Usage stats before and after re-aligning vCPU configuration from single core-multiple sockets, to multiple core-single socket

NUMA, in a nutshell, utilises the host CPU’s local memory bus to allow faster memory access time for workloads on that specific CPU. This is particularly important for latency sensitive workloads. vNUMA is VMware’s implementation of utilising NUMA to reduce memory latency for VMs running on ESXi.

When looking at CPU performance issues with a guest VM, no host level contention for CPU resources existed. Through the VM performance graphs in the vSphere client, CPU Ready times could be seen to be spiking often, this was on a VM with multiple CPUs set to 2 x socket and 8 x cores (16 vCPU).

This shows the CPU Ready/Usage stats before and after re-aligning vCPU configuration from single core-multiple sockets, to multiple core-single socket; a marked difference

 

This problem is fairly well documented, and is detailed in VMware KB 1026063; when utilising the NUMA features of VMware, it is important to configure the vCPU layouts for your VMs to align with the physical characteristics of your hosts if possible; this will help to guarantee that the VM guest can be vMotioned to other hosts in the cluster, with different physical CPU configurations. A better solution, where identical CPU configurations can not be guaranteed across all your hosts, is to define all vCPUs as x sockets with a single core.

In this case our physical CPU architecture is 2 sockets x 6 cores, and the VM was configured with 1 socket x 8 cores. This prevents the hypervisor and guest OS from completely utilising either one, or both sockets in our physical host, and is therefore missing out on the speed benefits which NUMA can bring. And can be exhibited as high or consistently high ready States for this VM guest.

There is a great VMware blog article about the performance implications of vNUMA design selection here, which echoes the VMware Best Practices guide, stating that the best way to approach this is to set your vCPU configuration ‘flat and wide’. this means that if your VM requires 8 cores, then configure your vCPU with 8 sockets with 1 core each, rather than 2 quad-core or 4 dual-core sockets.

This allows the vNUMA technology to balance the load as it best sees fit and should prevent CPU Ready spike issues. There are of course edge cases, where specific software licensing may force you to use as few sockets as possible, in which case alignment to physical host CPU should always be attempted, in the case above, the VM could have been configured to 1 socket x 6 cores, or 2 sockets x 6 cores. Be aware, that stepping outside of the ‘flat and wide’ model will prevent vNUMA from doing its job, and will bow to your judgement of vCPU configuration; this means you had better have got it right!

What are the problems in my environment?

I have worked as a SysAdmin for around 7 years, working across a wide range of technologies. My current role is a 3rd Line Server Infrastructure Engineer, specifically supporting VMware and Cisco UCS platforms providing IaaS for over 50 clients totalling around 1500 VMs.

The virtualised environment I support is almost exclusively running on vSphere 5.1, we have 5 different Production vCenter servers, none of them linked, and running on a variety of hardware. The majority of the compute is running on Cisco UCS blades, which are a challenge to manage in themselves, but more on that another time. Most of our storage is EMC VNX, with some IBM SVC fronted kit thrown in for good measure, all running on Fibre Channel, and our network stack is running on Cisco Nexus switches, using the simultaneously great, and terrible, Cisco Nexus 1000v for our vDS.

I guess this is all pretty standard stuff, so what problems do I see on a daily basis? Well I plan to talk in this blog about the problems I see, how we can get around them, and what steps we can take to more effectively manage the issues thrown up by this infrastructure.