Transparent Page Sharing – for better or worse

Transparent Page Sharing (TPS) is one of the cornerstones in memory management in the ESXi hypervisor, this is one of the many technologies VMware have developed, which allows higher VM consolidation ratios on hosts through intelligent analysis of VM memory utilisation, and deduplication based on this analysis.

There are two types of TPS, intra-VM and inter-VM. The first scans the memory in use by a single VM and deduplicates common block patterns, this will lower the total host memory consumption of a VM, without significantly impacting VM performance. The second type; inter-VM TPS, does the same process, looking for common blocks across memory usage for all VMs on a given host.

Historically this has been a successful reclamation technique, and has led to decent savings in host memory consumption, however, most modern Operating Systems seen in virtualised environments (most Linux distributions, Windows Server 2008 onwards) now use memory encryption by default so the chance of the TPS daemon finding common blocks becomes less and less likely.

If CPU integrated hardware-assisted memory virtualisation features (AMD Rapid Virtualisation Indexing (RVI) or Intel Extended Page Tables (EPT)) are utilised for an ESXi host, then the hypervisor will use 2MB block size for its TPS calculations, rather than the normal 4KB block size. Attempting to deduplicate in 2MB chunks is far more resource intensive, and far less successful, than running the same process in 4KB chunks, thus ESXi will not attempt to deduplicate and share large memory pages by default.

The upshot of this is that 2MB pages are scanned, and 4KB blocks within these large pages  hashed in preparation for inducing memory sharing should the host come under memory contention, in an effort to prevent sharing. Pre-hashing these 4KB chunks will mean that TPS is able to quickly react, deduplicating and sharing the pages should the need arise.

All good so far, this technique should help us to save a bit of memory, although since memory virtualisation features in modern CPUs are widespread, and larger amounts of host memory more common, the potential TPS savings should hopefully never be needed or seen.

At the back end of last year, VMware announced that they would be disabling TPS by default in ESXi following an academic paper which showed a potential security vulnerability in TPS which, if exploited, could result in sensitive data being made available from one VM to another utilising memory sharing. It should be noted that the technique used to exploit this is in a highly controlled laboratory style condition, and requires physical access to the host in question, and that the researcher highlighting it never actually managed to glean any data in this method.

Despite the theoretical nature of the published vulnerability, VMware took the precautionary approach, and so in ESXi 5.0, 5.1, 5.5 and now 6.0, with the latest updates, TPS is now disabled by default. What does this mean for the enterprise though, and what choices do we have?

1.Turn TPS on regardless – if you have a dedicated internal-only infrastructure, then it may be that you do not care about the risks exposed by the research. If you and your team of system administrators are the only ones with access to your ESXi servers, and the VMs within, as is common in many internal IT departments, then there are likely far easier ways to get access to sensitive data than by utilising this theoretical technique anyway

2.Turn off TPS – if you are in a shared Service Provider style infrastructure, or in an environment requiring the highest security, then this should be a no-brainer. The integrity and security of the client data you have on your systems should be foremost in your organisations mind, and in the interests of precaution, and good practice, you should disable TPS on existing systems and leave it off

This were the options presented by VMware until about a week ago, when an article was published which described a third option being introduced in the next ESXi updates:

3. Use TPS with guest VM salting – this allows selective inter-VM TPS enablement among sets of selected VMs, allowing you to limit the potential areas of vulnerability, using a combination of edits to .vmx files, and advanced host settings. This may be a good middle ground if you are reliant on the benefits provided by TPS, but security policy demands that it not be in the previous default mode

So these are our options, and regardless of which one you choose, you need to know what the difference in your environment will be if you turn off TPS, this is going to be different for everyone. Your current savings being delivered by TPS can be calculated, and this should give you some idea of what the change in host memory utilisation will be following the disabling of TPS.

The quickest way to see this for an individual host is via the Performance tab in vSphere Client, if you look at real time memory usage, and select the ‘Active’, ‘Shared’ and ‘Shared Common’ counters then you will be able to see how much memory is consumed in total by your host, and how much of this is being saved through TPS:

TPS_vSphere_Client

Here we can see:

TPS %age saving = (Shared – Shared common) / (Consumed – Used by VMkernel) * 100% = (5743984 – 112448) / (160286356 – 3111764) * 100% = 5631536 / 157174592 * 100% = 3.58%

So TPS is saving around 5.6GB or 3.6% of total memory on the host being consumed by VMs. This is a marker of the efficiency of TPS.

The same figures can be taken from esxtop if you SSH to an ESXi host, run esxtop, and press ‘m’ to get to memory view.

TPS_esxtop

Here we are looking at the PSHARE value, we can see the saving is 5607MB (ties up with above from the vSphere Client), and the memory consumed by VMs can be seen under PMEM/other, in this case 153104MB. Again we can calculate the percentage saving TPS is giving us by dividing the saving by the active memory and multiplying by 100%:

TPS %age saving = PSHARE saving / PMEM other * 100% = 5607 / 153104 * 100% = 3.66%

So this is how we can calculate the saving for each host, but what if you have dozens, or hundreds of hosts in your environment, wouldn’t it be great to get these stats for all your hosts? Well, the easiest way to get this kind of information is usually through PowerCLI so I put the following script together:


# Ask for connection details, then connect using these
$vcenter = Read-Host "Enter vCenter Name or IP"


# Set up our constants for logging
$datetime = get-date -uformat "%C%y%m%d-%H%M"
$OutputFile = ".\" + $datetime + "_" + $vcenter + "_TPS_Report.csv"

# Connect to vCenter
$Connection = Connect-VIServer $vcenter

$myArray = @()

forEach ($Cluster in Get-Cluster) {
foreach($esxhost in ($Cluster | Get-VMHost | Where { ($_.ConnectionState -eq "Connected") -or ($_.ConnectionState -eq "Maintenance")} | Sort Name)) {
$vmdetails = "" | select hostname,clustername,memsizegb,memshavg,memshcom,tpssaving,percenttotalmemsaved,tpsefficiencypercent
$vmdetails.hostname = $esxhost.name
$vmdetails.clustername = $cluster.name
$hostmem = Get-VMHost $esxhost | Select -exp memorytotalgb
$vmdetails.memsizegb = "{0:N0}" -f $hostmem
$vmdetails.memshavg = [math]::Round((Get-VMhost $esxhost | Get-Stat -Stat mem.shared.average -MaxSamples 1 -Realtime | Select -exp value),2)
$vmdetails.memshcom = [math]::Round((Get-VMhost $esxhost | Get-Stat -Stat mem.sharedcommon.average -MaxSamples 1 -Realtime | Select -exp value),2)
$vmdetails.tpssaving = $vmdetails.memshavg-$vmdetails.memshcom
$vmdetails.percenttotalmemsaved = [math]::Round(([int]$vmdetails.tpssaving/([int]$vmdetails.memsizegb*1024*1024))*100,2)
$consumedmemvm = [math]::Round(((Get-VMhost $esxhost | Get-Stat -Stat mem.consumed.average -MaxSamples 1 -Realtime | Select -exp value)-(Get-VMhost $esxhost | Get-Stat -Stat mem.sysUsage.average -MaxSamples 1 -Realtime | Select -exp value)),2)
$vmdetails.tpsefficiencypercent = [math]::Round(([int]$vmdetails.tpssaving/$consumedmemvm)*100,2)
$myArray += $vmdetails
}
}
Disconnect-VIServer * -Confirm:$false

$myArray | Sort Name | Export-Csv -Path $Outputfile

This script will dump out a CSV with every host in your vCenter, and tell you the percentage of total host memory saved by TPS, and the efficiency of TPS in your environment. This should help to provide some idea of what the impacts of TPS being turned off will be.

Ultimately, your organisation’s security policies should define what to do after the next ESXi updates, and how you should act in the meantime, TPS is definitely a useful feature, and does allow for higher consolidation ratios, but security vulnerabilities should not be ignored. Hopefully this post will give you an idea of how TPS is currently impacting your infrastructure.

NetApp Cluster Mode Data ONTAP (CDOT) 8.3 Reversion to 8.2 7-Mode

A project came in at work to build out a couple of new NetApp FAS2552 arrays; this was to replace old FAS2020s for a customer who was using FCP in their Production datacenter, and iSCSI in their DR datacenter, with a semi-synchronous Snapmirror relationship between the two.

The new arrays arrived on site, and we set them up separate from the production network, to configure them. We quickly identified that the 2552s were running OnTap 8.3RC1, which is how they were sent to us out of the factory. Nobody had any experience with Cluster Mode Data ONTAP, but this didn’t seem too much of a challenge, as it did not seem hugely different.

After looking what to do next, it appeared that transitioning SAN volumes from 7-mode to Cluster Mode Data ONTAP is not possible, so the decision was taken to downgrade the OS from 8.3RC1, to 8.2 7-mode to make the transition of the customer’s data, and the downtime during switchover from old arrays to new, be as easy and quick as possible.

We got there in the end, but due to the tomes of documentation we had to trawl through, and tie together, I decided to document the process, to assist any would be future CDOT luddites in carrying out this task.

NOTE: This has not been tested on anything other than a FAS2552 with two controllers, and if you are in any way uncertain I would suggest contacting NetApp support for assistance. As this was a brand new array, and there was no risk of data loss, we proceeded regardless. You will need a NetApp support account to access some of the documentation and downloads referenced below. This is the way we completed the downgrade, not saying it is the best way, and although I have many years experience of working with NetApp arrays, this is just a guide.

  • Downloading and updating the boot image:

We decided on 8.2.3 for our boot image, this was the last edition of Data ONTAP with 7-mode included. If you go to http://mysupport.netapp.com/NOW/cgi-bin/software/ and select your array type you will see the available versions for your array. There are pages of disclaimers to agree to, and documents of pre-requisites and release notes for each version, these are worth reading to ensure there are no known issues with your array type. Eventually you will get the download, it will be a .tgz file.

You will now need a system with IP connectivity to both controllers, and use something like FileZilla Server to host the file via FTP. This will allow you to get the file up to the controller. I am not going to include steps to setup your FTP server, but there are plenty of resources online to do this. You could also host this via HTTP using something like IIS if that is more convenient.

Now to pull the image onto the array, this will need doing on both controllers (nodes), this document was followed, specifically the following command (based on content on page 143):

 system node image get -node localhost -package <location> -replace-package true - background true

I changed the command to replace ‘-node *’ with ‘-node localhost’ so we could download the image to each node in turn, this was just to ensure we could tackle any issues with the download. I also removed the ‘-background true’ switch, which would run the download in the background, this was to give us maximum visibility.

Now our cluster had never been properly configured, there are a bunch of checks to do at this point to ensure your node is ready for the reversion, these are all detailed in the above document and should be followed to make sure nothing is amiss. We ran through these checks prior to installing the newly downloaded image. This includes things

Once happy, the image can be installed by running:

system node image update -node localhost -package file:///mroot/etc/software/<image_name>

The image name will be the name of the .tgz file you downloaded to the controller earlier (including the extension).

Once the image is installed, you can check the state of the installation with:

 system image show

This should show something like:

Screen Shot 2015-02-12 at 20.41.26

This shows the images for one controller only, but shows us the image we are reverting to is loaded into the system, and we can move on.

There are some more steps in the document to follow, ensuring the cluster is shutdown, and failover is disabled before we can revert, follow these from the same document as above.

Next we would normally run ‘revert_to 8.2’ to revert the firmware. However, we had issues at this point because of the ADP (Advanced Drive Partitioning), which seems to mark the disks as in a shared container. It goes into the background here, in Dan Barber’s excellent article. Long story short, we decided to reboot and format the array again to get round this.

  • Re-zeroing the disks and building new vol0:

We rebooted the first controller, and saw that when it came back up it was running in 8.2.3 (yay) Cluster Mode (boo). We tried zeroing the disks and building a new vol0, by interrupting the boot sequence with Ctrl+C to get to the special boot menu, and then running option 4, this was no good for us though, because once built, the controller booted into 8.2.3 Cluster Mode, a new tactic would be required.

We found this blog post on Krish Palamadathil’s blog, which detailed how to get around this. The downloaded image contains both Cluster Mode and 7-Mode images, but boots into Cluster Mode by default when doing this reversion. Cutting to the chase, the only thing we needed to do was to get to the Boot Loader (Ctrl+C during reboot to abort the boot process), and then run the following commands:

 LOADER> set-defaults 
 LOADER> boot_ontap

We then saw the controller come up in 8.2.3 7-Mode, interrupted the boot sequence, and ran an option 4 to zero the disks again and build a new vol0

Happy to say that the array is now at the correct version and in a state where it can now be configured. As usual, the NetApp documentation was great, even if we had to source steps from numerous different places. As this is still a very new version of Data ONTAP I would expect this documentation to get better over time, in the meantime hopefully this guide can be of use to people.

Ballooning – the lowdown

VMware Tools does many things for us as administrators, automating much of the resource management and monitoring we need so we can have confidence in managing the sprawling clusters of a private cloud.

Below are some of the benefits VMware Tools affords us:

  • Driver optimisation
  • Power settings management
  • Time synchronisation
  • Advanced memory management
  • VM Heartbeating

VMware Tools is available for every supported guest Operating System, and with the exception of certain appliances (which often come with 3rd Party tools installed anyway), installation of VMTools should be common practice.

In this article I am going to talk about one specific memory management technique: ballooning. Ballooning is a method for reclaiming host memory in times of contention, this allows for more workloads to run on a host without resorting to using swapping.

The VMTools service/daemon runs as any other process does in an Operating System, and can request resources the same as any other processes. When memory contention is high on a host, and VM allocated memory would likely need to be swapped to disk (the .vswp file, see here for more information on that), the hypervisor will send a request to the VMTools process running on its guest VMs to try and reclaim memory.

Often, software running in an OS will not release memory when it is done with it, this can lead to memory being tied up in the OS, and can therefore not be released to the hypervisor to return to the pool of available host memory.

When the VMTools process receives the signal, it will request up to 65% of the host memory be relinquished to it, at which time it will release the memory back to the hypervisor. Because of the way we allocate memory to VMs (vRAM), the VM often has more memory available to it than it requires and as such, most VMs will have memory which can be returned.

Ballooning is a normal part of the automated memory management processes which ESXi provides, high amounts of ballooning do not usually indicate a problem, although large amounts of ballooning activity could tell you that memory is overcommitted.

The only time ballooning can cause a problem is when the ballooning driver reclaims memory pages required by the guest OS, in this case it can lead to swapping which could lead to performance degradation.

tl;dr – ballooning is a normal part of vSphere’s memory management, assisting in pushing up consolidation ratios. Don’t worry about it, but it is good to be aware of what it does, and how it works.

VM Swap File location considerations

When a VM is powered on, a .vswp (Virtual Machine swap) file is created (note there is also a vmx swap file which gets created in the same location as the VM, this is seperate from this discussion, but is described here), its size is the memory allocation for the VM less any reservation configured. If there is not sufficient space in the configured swap file location to create this file then the VM will not power on. The use of this file for memory pages is a last resort, and will be considerably slower than using normal memory, even if this is compressed or shared. It will always be possible that you get in a situation where memory contention is occurring, and that the use of the swap file begins, to prepare for this a system design should consider the location of swap files for VMs. Below I discuss some of the considerations which should be made when placing a VM swap file:

  • Default location for swap file is to store it in the same datastore as the VM, this presents the following problems:
    • Performance – it is unlikely that the datastore the VM sits in is on top tier storage, or limited to a single VM. This means that the difference in speed between memory IO and the swapping IO once contention occurs will be great, and that the additional IO this swapping produces could well impact other workloads on the datastore. If there are multiple VMs sharing this datastore, and all running on the host with memory contention issues, then this will be further compounded and could see the datastore performance plummet
    • Capacity – inevitably, administrators will keep chucking workloads into a datastore, and unless Storage DRS with datastore clusters is being used, or the administrators are pro-active in balancing storage workloads, there will come a time when a VM will not power up due to insufficient space to create the .vswp file. This is particularly likely after a change to the VM configuration such as adding more disk or memory

VM swap file location can be changed either at the cluster, or host level. When choosing to move this from the default, the following should be considered:

  • Place swap files on the fastest storage possible – if you can place this on flash storage then fantastic, this will not be as quick as paging to/from memory, but it will be many magnitudes better than placing it on spinning disk
  • Place swap files as close to the host as possible – the latency incurred by traversing your SAN/IP network to get to shared storage will all impair guest performance when swapping occurs. Be aware that although the default location can be changed to host local storage (which will probably give the best performance of the host has internal flash storage), this will impair vMotion performance massively, as the entire .vswp file would need to be copied from the source host to the destination host’s disk during the vMotion activity
  • Do not place the .vswp on replicated storage – as with location selection for guest OS swap files, there is no point on placing the file on replicated storage; if the VM is reset or powered off then this file is deleted. If your VMs are on storage which is replicated as part of its standard capability then the .vswp files should definitely be located elsewhere

 

In terms of configuring the location, as stated above, this is set at either a VM, host or cluster level, if this is inconsistent across hosts in a cluster then again this may impact vMotion times as the VM migrates from a host with one configured location to another with a different location. As with most settings which can be made at the cluster level, consistency should be maintained across the cluster unless this is not possible. Bear in mind though, that having vswp consistent across the cluster, and defined to be a single datastore, could lead to high IOPS on this datastore should cluster wide memory contention occur., especially with large clusters.

As stated at the beginning of this article, swap files are sized based on VM memory allocation less reservations size. By right sizing VMs, and utilising reservations, swap file sizes, and usage, can be kept to a minimum, and these planning considerations should take precedence over all others. Hopefully memory contention will never be so bad that swap will be required, but when the day does come it is good to be prepared, by making informed, and reasoned decisions early on.

VMFS Extents – to extend or not to extend, that is the question

Extents allow disk presented to a vSphere system to be added to VMFS datastore to extend the file system, this aggregates multiple disks together and can be useful in a number of scenarios. Recently I saw problems where extents were being used spanning two storage systems; one of the storage systems had a controller failure which caused SCSI reservation issues on one of the LUNs making up the extent and this caused the entire datastore to go offline.

In this article I want to discuss some of the benefits and potential pitfalls in using VMFS extents in vSphere environments. Ultimately this is an available, supported, and sometimes useful feature of vSphere but there are some limitations or weaknesses that using this can bring.

Advantages:

  • Using extents allows you to create datastores up to the maximum supported by vSphere for pre-VMFS-5 datastores. It can be useful to create large datastores for the following reasons:
    • There may be a requirement to natively present a VMDK which is larger than the maximum LUN size available on your storage system. For example, if 2TB is the largest LUN you can present, but you need a 4TB disk for the application your VM is hosting, the aggregation of disks will allow the creation of a VMFS datastore large enough to deliver this without the need to span volumes in the guest OS, or the need to fallback to using something like RDMs which may impinge on other vSphere functionality
    • Datastore management will be simplified with fewer VMFS datastores required. The fewer datastores available, the less an administrator has to keep their eyes on. In addition to this, decisions made in placing VMs is made considerably simpler if there are fewer choices
  • Adding space to a datastore with capacity issues; in a previous role we were constrained by storage space more than any other resource, this meant that on both the storage system (NetApp FAS2050 with a single shelf of storage), and at the VMFS level, the design left little to no room to extend a VMDK should it be required. If we did need to add space to VMDKs, we had to extend the volume and LUN by the required amount on the filer, and add a small extent to the datastore in vSphere

Disadvantages:

  • Introduces a single point of failure; whether you are aggregating disks from one or multiple storage systems, by adding extents to a volume the head extent in the aggregated datastore (the first LUN added to the datastore) becomes a single point of failure, if any of the LUNs should become unavailable then VMs which have any blocks whatsoever on the lost LUN will no longer be available
  • Management from the storage side can become more difficult, given that there may be multiple LUNs, from multiple storage systems now aggregated to form a single datastore, from a storage side it is harder to identify which LUNs relate to which datastores in vSphere, to combat this it is important to document the addition of extents well, and label LUNs accordingly on the storage system
  • If extents are combined which span different storage devices then there may well be a loss in performance

The above is all just based on my experiences, but it seems there are legitimate use cases for choosing to use, or not use extents. My personal preference would be to present a new larger LUN where possible, formatting this in VMFS, and using Storage vMotion to migrate VMs to the new datastore. Given that since VMFS-5 introduced GPT as the partitioning method for LUNs, we can now create single extent datastores up to 64TB in size, the requirement for using extents should be diminished. There are often legitimate reasons, especially in older environments, why this is not practical or possible however, and in these cases using extents is perfectly valid.

PowerCLI – getting on the road

Powershell is a tool I have been interested in for a number of years, unfortunately in my previous role I was limited to using VBscript for automation due to running on a purely Windows Server 2003 environment, and one sans Powershell v1.

When I started in my current role, I was excited to start using Powershell to replace VBscript as my primary scripting language.I have found a number of resources which are great for learning Powershell, not least of which is the Powershell In A Month Of Lunches book by Don Jones. I started reading but ended up learning through just getting my hands dirty.

So, I rolled up my sleeves, and started using Powershell to do small things. Once I felt a little more comfortable with it, I decided to download VMware PowerCLI and give that a go. Never did I realise that I needed a tool so badly until I had it, I was blown away by what PowerCLI allowed me to do.Since then I have built up a good library of scripts which allow me to report on, and automate some of the daily VMware grind I experience in my work. Examples of scripts I have created, which provide functionality not available natively in vSphere are:

  • A script to go through a vCenter instance, identifying the Path Selection Policy for all disks, flagging and offering to correct where these are not correct
  • Scripts to list all VMs with RDMs on them, this helped us to plan for patching our hosts
  • A daily check script which exports all vCenter alarms to an HTML page to allow alarms across multiple vCenters to be checked quickly

I decided to list some of the resources which can really help to get going with PowerCLI, hopefully this will assist someone looking to learn to use this toolset:

  • Powershell In A Month Of Lunches book by Don Jones – this book is great for getting the basics of Powershell down
  • Notepad++ – there are other tools available, some recommend Quest’s PowerGUI, but this is what I use for PowerCLI scripting day to day. It is only available on Windows but if you are using PowerCLI then you are using Windows so this shouldn’t be too much of a problem
  • VMware’s PowerCLI Documentation site – this site is invaluable for the explanations of specific cmdlets and the available parameters
  • VMware vSphere PowerCLI Reference by Luk Dekens and Alan Renouf – this was the first book released on PowerCLI and has some great examples of what you can do with the tool

I was watching this interesting interview of Alan Renouf by Mike Laverick on his Chinwag Reloaded podcast, Alan now works on the PowerCLI team at VMware and has been a key player in the community of PowerCLI since its inception. I thought I would include some of this information here to show the developments to PowerCLI which are around the corner.

Here they discuss a new feature relating to PowerCLI (and available here), which started as a VMware fling and has been developed by the internal team responsible for the vSphere Web Client. The primary new feature is called ‘PowerActions’ and it will allow your library of scripts to be stored within vSphere, and make them accessible to other administrators using the Web Client as you desire. This was of interest to me as I have been looking at git style repositories to allow for version control and peer review which our company can use to store, share, and manage our ever growing library of PowerCLI scripts.

Further to this, and exciting for people running an OS other than Windows, or in an environment where installation of PowerCLI is not possible, is that a shell will be available through the Web Client. This will be a great help when troubleshooting and wanting to run some of the more useful PowerCLI commands without having to fire up and connect your usual shell.

There is even more new stuff coming as well; the ability to create new menu items in the Web Client which will run your scripts and return the output in a message box, this will mean that for common tasks like reporting or bespoke automation not present in vSphere, you can present these simply to other administrators or users of vSphere and they can utilise the additional functionality these provide without having to have any knowledge of PowerCLI.

Where other vendors have provided limited API integration through their own PowerShell libraries, it is great to see VMware throwing a lot of time and money into ensuring their API delivers what customers want, and that a growing and helpful community has developed around this which allows a keen administrator to quickly learn and develop their skills with PowerCLI.

If you are a VMware administrator then PowerCLI is definitely something you should get involved with, and there has never been a better time to do this. I will look to publish links to some of my scripts at some point, as well as discuss some useful PowerCLI cmdlets.

Storage I/O Control – what to expect

Storage I/O Control, or SIOC, was introduced into vSphere back in vSphere 4.1, it provides a way for vSphere to combat what is known as the ‘noisy neighbour’ syndrome. This describes the situation where multiple VMs reside on a single datastore, and one or more of these VMs take more than their fair share of bandwidth to the datastore. This could be happening because a VM decides to misbehave, because of poor choices in VM placement, or because workloads have changed.

The reigning principle behind SIOC is one of fairness, allowing all VMs a chance to read and write without being swamped by one or more ‘greedy’ VMs. This is something which, in the past, would have been controlled by disk shares, and indeed this method can still be used to prioritise certain workloads on a datastore over others. The advantage with SIOC is that, other than the couple of configurable settings, described below, no manual tinkering is really required.

Options available for Storage I/O Control
Options available for Storage I/O Control

There are only two settings to pick for SIOC:

1) SIOC Enabled/Disabled – either turn SIOC on, or off, at the datastore level. More on considerations for this further down

2) Congestion Threshold – this is the trigger point at which SIOC will kick in and start doing its thing, throttling I/O to the datastore. This can be configured with one of two types of value:

a) Manual – this is set in milliseconds and this defaults at 30ms, but is variable depending on your storage. VMware have tables on how to calculate this in their SIOC best practice guide, but the default should be fine for most situations. If in doubt then your storage provider should be able to give guidance on the correct value to choose.

b) Percentage of peak throughput – this is only available through the vSphere Web Client, and was added in vSphere 5.1, this takes the guess work out of setting the threshold, replacing it with an automated method for vSphere to analyse the datastore I/O capabilities and use this to determine the peak throughput.

My experience of using SIOC is described in the following paragraphs, improvements were seen, and no negative performance experienced (as expected), although some unexpected results were received.

Repeated latency warnings similar to the following from multiple hosts were seen, for multiple datastores across different storage systems:

Device naa.5000c5000b36354b performance has deteriorated. I/O latency increased from average value of 1832 microseconds to 19403 microseconds

These warnings report the latency time in microseconds, so in the above example, the latency is going from 1.8ms to 19ms, still a workable latency, but the rise is flagged due to the large increase (in this case by a factor of ten). The results seen in the logs were much worse than this though, sometimes latency was rising to as much as 20 seconds, this was happening mostly in the middle of the night

After checking out the storage configuration, it was identified that Storage I/O Control was turned off across the board. This is set to disabled by default for all datastores and as such, had been left as was. Turning SIOC on seemed like a sensible way forward so the decision was taken to proceed in turning it on for some of the worst affected datastores.

After turning on SIOC on a handful of datastores, a good reduction in the number of I/O latency doublings being reported in the ESXi logs was seen. Unfortunately a new message began to flag in the host events logs:

Non-VI workload detected on the datastore

This was repeatedly seen against the LUNs for which SIOC had been enabled, VMware have a knowledge base article for this which describes the issue. In this case, the problem stemmed from the fact that the storage backend providing the LUNs had a single disk pool (or mDisk Group, as this was presented by an IBM SVC) which was shared with unmanaged RDMs, and other storage presented outside the VMware environment.

The impact of this is that, whilst VMware plays nicely, throttling I/O access when threshold congestion is reached, other workloads such as non-SIOC datastores, RDMs, or other clients of the storage group, will not be so fair in their usage of the available bandwidth. This is due to the spindles presented being shared, one solution to this would be to present dedicated disk groups to VMware workloads, ensuring that all datastore carved out of these disks have SIOC turned on.

We use EMC VNX, and IBM SVC as our storage of choice, recommendations from both these vendors is to turn SIOC on for all datastores, and to leave it on. I can only imagine that the reason this is still not a default is because it is not suitable for every storage type. As with all these things, checking storage vendor documentation is probably the best option, but SIOC should provide benefit in most use cases, although as described above, you may see some unexpected results. It is worth noting that this feature is Enterprise Plus only, so anyone running a less feature packed version of vSphere will not be able to take advantage of this feature.