Storage I/O Control – what to expect

Storage I/O Control, or SIOC, was introduced into vSphere back in vSphere 4.1, it provides a way for vSphere to combat what is known as the ‘noisy neighbour’ syndrome. This describes the situation where multiple VMs reside on a single datastore, and one or more of these VMs take more than their fair share of bandwidth to the datastore. This could be happening because a VM decides to misbehave, because of poor choices in VM placement, or because workloads have changed.

The reigning principle behind SIOC is one of fairness, allowing all VMs a chance to read and write without being swamped by one or more ‘greedy’ VMs. This is something which, in the past, would have been controlled by disk shares, and indeed this method can still be used to prioritise certain workloads on a datastore over others. The advantage with SIOC is that, other than the couple of configurable settings, described below, no manual tinkering is really required.

Options available for Storage I/O Control
Options available for Storage I/O Control

There are only two settings to pick for SIOC:

1) SIOC Enabled/Disabled – either turn SIOC on, or off, at the datastore level. More on considerations for this further down

2) Congestion Threshold – this is the trigger point at which SIOC will kick in and start doing its thing, throttling I/O to the datastore. This can be configured with one of two types of value:

a) Manual – this is set in milliseconds and this defaults at 30ms, but is variable depending on your storage. VMware have tables on how to calculate this in their SIOC best practice guide, but the default should be fine for most situations. If in doubt then your storage provider should be able to give guidance on the correct value to choose.

b) Percentage of peak throughput – this is only available through the vSphere Web Client, and was added in vSphere 5.1, this takes the guess work out of setting the threshold, replacing it with an automated method for vSphere to analyse the datastore I/O capabilities and use this to determine the peak throughput.

My experience of using SIOC is described in the following paragraphs, improvements were seen, and no negative performance experienced (as expected), although some unexpected results were received.

Repeated latency warnings similar to the following from multiple hosts were seen, for multiple datastores across different storage systems:

Device naa.5000c5000b36354b performance has deteriorated. I/O latency increased from average value of 1832 microseconds to 19403 microseconds

These warnings report the latency time in microseconds, so in the above example, the latency is going from 1.8ms to 19ms, still a workable latency, but the rise is flagged due to the large increase (in this case by a factor of ten). The results seen in the logs were much worse than this though, sometimes latency was rising to as much as 20 seconds, this was happening mostly in the middle of the night

After checking out the storage configuration, it was identified that Storage I/O Control was turned off across the board. This is set to disabled by default for all datastores and as such, had been left as was. Turning SIOC on seemed like a sensible way forward so the decision was taken to proceed in turning it on for some of the worst affected datastores.

After turning on SIOC on a handful of datastores, a good reduction in the number of I/O latency doublings being reported in the ESXi logs was seen. Unfortunately a new message began to flag in the host events logs:

Non-VI workload detected on the datastore

This was repeatedly seen against the LUNs for which SIOC had been enabled, VMware have a knowledge base article for this which describes the issue. In this case, the problem stemmed from the fact that the storage backend providing the LUNs had a single disk pool (or mDisk Group, as this was presented by an IBM SVC) which was shared with unmanaged RDMs, and other storage presented outside the VMware environment.

The impact of this is that, whilst VMware plays nicely, throttling I/O access when threshold congestion is reached, other workloads such as non-SIOC datastores, RDMs, or other clients of the storage group, will not be so fair in their usage of the available bandwidth. This is due to the spindles presented being shared, one solution to this would be to present dedicated disk groups to VMware workloads, ensuring that all datastore carved out of these disks have SIOC turned on.

We use EMC VNX, and IBM SVC as our storage of choice, recommendations from both these vendors is to turn SIOC on for all datastores, and to leave it on. I can only imagine that the reason this is still not a default is because it is not suitable for every storage type. As with all these things, checking storage vendor documentation is probably the best option, but SIOC should provide benefit in most use cases, although as described above, you may see some unexpected results. It is worth noting that this feature is Enterprise Plus only, so anyone running a less feature packed version of vSphere will not be able to take advantage of this feature.

Advertisements

2 thoughts on “Storage I/O Control – what to expect”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s