Incorrectly Reported Separated Network Partitions in VSAN Cluster

I’ve been playing around with VSAN, automating the build of a 3 node Management cluster using ESXi 6.0 Update 1. I came across and issue where I moved one of my hosts to another cluster and then back into the VSAN cluster, and when it came back it showed as a separate network partition, and had a separate VSAN datastore.

The VSAN Disk Management page under my cluster in the Web Client showed that the Network Partition Group was different for this host to my other two hosts, despite the network being absolutely fine.

Turned out that the host had not rejoined the VSAN cluster, but had created its own 1-node cluster. I resolved this by running the following commands:

On the partitioned host:

esxcli vsan cluster get

Cluster Information

   Enabled: true

   Current Local Time: 2016-09-21T10:23:35Z

   Local Node UUID: 57e0040c-83a9-add9-ec1f-0cc47ab46218

   Local Node Type: NORMAL

   Local Node State: MASTER

   Local Node Health State: HEALTHY

   Sub-Cluster Master UUID: 57e0040c-83a9-add9-ec1f-0cc47ab46218

   Sub-Cluster Backup UUID:

   Sub-Cluster UUID: 3451e257-cedd-8772-4b31-0cc47ab460e8

   Sub-Cluster Membership Entry Revision: 0

   Sub-Cluster Member Count: 1

   Sub-Cluster Member UUIDs: 57e0040c-83a9-add9-ec1f-0cc47ab46218

   Sub-Cluster Membership UUID: 9c5fe257-e053-7716-ca0a-0cc47ab46218

This shows the host in a single node cluster

On a surviving host:

esxcli vsan cluster get

Cluster Information

   Enabled: true

   Current Local Time: 2016-09-21T11:14:55Z

   Local Node UUID: 57e006b6-71ab-c8f6-7d1d-0cc47ab460e8

   Local Node Type: NORMAL

   Local Node State: MASTER

   Local Node Health State: HEALTHY

   Sub-Cluster Master UUID: 57e006b6-71ab-c8f6-7d1d-0cc47ab460e8

   Sub-Cluster Backup UUID: 57e0f22f-3071-fe1a-fd8e-0cc47ab460ec

   Sub-Cluster UUID: 57e0040c-83a9-add9-ec1f-0cc47ab46218

   Sub-Cluster Membership Entry Revision: 0

   Sub-Cluster Member Count: 2

   Sub-Cluster Member UUIDs: 57e0f22f-3071-fe1a-fd8e-0cc47ab460ec, 57e006b6-71ab-c8f6-7d1d-0cc47ab460e8

   Sub-Cluster Membership UUID: 3451e257-cedd-8772-4b31-0cc47ab460e8

This showed me there were only 2 nodes in the cluster, we will use the Sub-Cluster UUID from here in a moment.

On the partitioned host:

esxcli vsan cluster leave

esxcli vsan cluster join -u 57e0040c-83a9-add9-ec1f-0cc47ab46218

esxcli vsan cluster get

Cluster Information

   Enabled: true

   Current Local Time: 2016-09-21T10:24:26Z

   Local Node UUID: 57e0040c-83a9-add9-ec1f-0cc47ab46218

   Local Node Type: NORMAL

   Local Node State: AGENT

   Local Node Health State: HEALTHY

   Sub-Cluster Master UUID: 57e006b6-71ab-c8f6-7d1d-0cc47ab460e8

   Sub-Cluster Backup UUID: 57e0f22f-3071-fe1a-fd8e-0cc47ab460ec

   Sub-Cluster UUID: 57e0040c-83a9-add9-ec1f-0cc47ab46218

   Sub-Cluster Membership Entry Revision: 1

   Sub-Cluster Member Count: 3

   Sub-Cluster Member UUIDs: 57e0f22f-3071-fe1a-fd8e-0cc47ab460ec, 57e006b6-71ab-c8f6-7d1d-0cc47ab460e8, 57e0040c-83a9-add9-ec1f-0cc47ab46218

   Sub-Cluster Membership UUID: 3451e257-cedd-8772-4b31-0cc47ab460e8

Now we see all three nodes back in the cluster. The data will take some time to rebuild on this node, but once done, the VSAN health check should show as Healthy, and there should be a single VSAN datastore spanning all hosts.

In the zone…basic zoning on Cisco Nexus switches

In this post I look to go over some basic FCoE zoning concepts for Cisco Nexus switches, although FCoE has not really captured the imagination of the industry, it is used in a large number of Cisco infrastructure deployments, particularly around Nexus and UCS technologies. My experience is mostly based on FlexPods where we have this kind of design (this shows FCoE connectivity only, to keep things simple):

Screen Shot 2015-11-08 at 18.38.23

Zoning in a FlexPod is simple enough, we may have a largish number of hosts, but we are only zoning on two switches, only have 4 or 8 FCoE targets, depending on our configuration. In fact the zoning configuration can be fairly simply automated using PowerShell by tapping into the NetApp, Cisco UCS, and NX-OS APIs. The purpose of what we are doing here though is to describe the configuration steps required to complete the zoning.

The purpose of zoning is to restrict access to a LUN (Logical Unit Number), or essentially a Fibre Channel block device, on our storage, to one or more access devices, or hosts. This is useful in the case of boot disks, where we only ever want a single host accessing that device, and in the case of shared data devices, like cluster shared disks in Microsoft Clustering, or VMFS datastore in the VMware world, where we only want a subset of hosts to be able to access the device.

I found configuring zoning on a Cisco switch took a bit of getting my head around, so hopefully the explanation below will help to make this simpler for someone else.

From the Cisco UCS (or any server infrastructure you are running), you will need to gather a list of the WWPNs for the initiators wanting to connect to the storage. These will be in the format of 50:02:77:a4:10:0c:4e:21, this being a 16 byte hexadecimal number. Likewise, you will need to gather the WWPNs from your storage (in the case of FlexPod, your NetApp storage system).

Once we have these, we are ready to do our zoning on the switch. When doing the zoning configuration there are three main elements of configuration we need to understand:

  1. Aliases – these match the WWN to a friendly name, and sit in the device alias database on the switch. You can get away without using this, and just use the native WWNs later, but this will make things far more difficult should something go wrong. So basically these just match WWNs to devices.
  2. Zones – these logically group initiators and targets together, meaning that only the device aliases listed in the zone are able to talk to one another. This provides security and ease of management, a device can exist in more than one zone.
  3. Zonesets – this groups together zones, allowing the administrator to bring all the zones online or offline together. Only one zoneset can be active at a time.

On top of this, there is one more thing to understand when creating zoning on our Nexus switch, and that is the concept of a VSAN. A VSAN, or Virtual Storage Area Network, is the Fibre Channel equivalent of a VLAN. It is a logical collection of ports which together form a single discrete fabric.

So let’s create a fictional scenario, and create the zoning configuration for this. We have a FlexPod with two Nexus 5k switches, with separate fabrics, as shown in the diagram above, meaning that our 0a ports on the servers only go to Nexus on fabric A, and 0b ports only go to Nexus on fabric B. Both of our e0c ports on our NetApp storage go to NexusA, and both our e0d ports go to NexusB:

AFAS01 – e0c, e0d

NAFAS02 – e0c, e0d

And 3 Cisco UCS service profiles, each with two vHBAs, wanting to access storage on these targets, these are created as follows:

UCSServ01 – 0a, 0b

UCSServ02 – 0a, 0b

UCSServ03 – 0a, 0b

So on NexusA, we need the following aliases in our database:

Device Port WWPN Alias Name
NAFAS01 e0c 35:20:01:0c:11:22:33:44 NAFAS01_e0c
NAFAS02 e0c 35:20:02:0c:11:22:33:44 NAFAS01_e0c
UCSServ01 0a 50:02:77:a4:10:0c:0a:01 UCSServ01_0a
UCSServ02 0a 50:02:77:a4:10:0c:0a:02 UCSServ02_0a
UCSServ03 0a 50:02:77:a4:10:0c:0a:03 UCSServ03_0a

And on NexusB, we need the following:

Device Port WWPN Alias Name
NAFAS01 e0d 35:20:01:0d:11:22:33:44 NAFAS01_e0d
NAFAS02 e0d 35:20:02:0d:11:22:33:44 NAFAS02_e0d
UCSServ01 0b 50:02:77:a4:10:0c:0b:01 UCSServ01_0b
UCSServ02 0b 50:02:77:a4:10:0c:0b:02 UCSServ02_0b
UCSServ03 0b 50:02:77:a4:10:0c:0b:03 UCSServ03_0b

And the zones we need on each switch are, firstly for NexusA:

Zone Name Members
UCSServ01_a NAFAS01_e0c

NAFAS01_e0c

UCSServ01_0a

UCSServ02_a NAFAS01_e0c

NAFAS01_e0c

UCSServ02_0a

UCSServ03_a NAFAS01_e0c

NAFAS01_e0c

UCSServ03_0a

And for Nexus B:

Zone Name Members
UCSServ01_b NAFAS01_e0d

NAFAS01_e0d

UCSServ01_0b

UCSServ02_b NAFAS01_e0d

NAFAS01_e0d

UCSServ02_0b

UCSServ03_b NAFAS01_e0d

NAFAS01_e0d

UCSServ03_0b

This gives us a zone for each server to boot from, allowing that vHBA on the server to boot from either of the NetApp interfaces that it will be able to see on its fabric. The boot order itself will be controlled from within UCS, by creating zoning for the server to boot on either fabric we create resilience. All of this is just to demonstrate how we construct the zoning configuration so things will no doubt be different in a different environment.

So now we know what we should have in our populated alias database, and our zone configuration, we just need to create our zoneset. Well, we will have one zoneset per fabric, so one for NexusA:

Zoneset Name Members
UCSZonesetA UCSServ01_a

UCSServ02_a

UCSServ03_a

And the zoneset for NexusB:

Zoneset Name Members
UCSZonesetB UCSServ01_b

UCSServ02_b

UCSServ03_b

Now we are ready to put this into some NXOS CLI, and enter this on our switches. The general commands for creating new aliases are:
device-alias database
device-alias name <alias_name> pwwn <device_wwpn>
exit
device-alias commit

So for our NexusA, we do the following:
device-alias database
device-alias name NAFAS01_e0c pwwn 35:20:01:0c:11:22:33:44
device-alias name NAFAS02_e0c pwwn 35:20:02:0c:11:22:33:44
device-alias name UCSServ01_0a pwwn 50:02:77:a4:10:0c:0a:01
device-alias name UCSServ02_0a pwwn 50:02:77:a4:10:0c:0a:02
device-alias name UCSServ03_0a pwwn 50:02:77:a4:10:0c:0a:03
exit
device-alias commit

And for Nexus B, we do:
device-alias database
device-alias name NAFAS01_e0d pwwn 35:20:01:0d:11:22:33:44
device-alias name NAFAS02_e0d pwwn 35:20:02:0d:11:22:33:44
device-alias name UCSServ01_0b pwwn 50:02:77:a4:10:0c:0b:01
device-alias name UCSServ02_0b pwwn 50:02:77:a4:10:0c:0b:02
device-alias name UCSServ03_0b pwwn 50:02:77:a4:10:0c:0b:03
exit
device-alias commit

So that’s our alias database taken care of, now we can create our zones. The command set for creating a zone is:
zone name <zone_name> vsan <vsan_id>
member device-alias <device_1_alias>
member device-alias <device_2_alias>
member device-alias <device_3_alias>
exit

I will use VSAN IDs 101 for fabric A, and 102 for fabric B. So here we will create our zones for NexusA:
zone name UCSServ01_a vsan 101
member device-alias NAFAS01_e0c
member device-alias NAFAS02_e0c
member device-alias UCSServ01_0a
exit
zone name UCSServ02_a vsan 101
member device-alias NAFAS01_e0c
member device-alias NAFAS02_e0c
member device-alias UCSServ02_0a
exit
zone name UCSServ03_a vsan 101
member device-alias NAFAS01_e0c
member device-alias NAFAS02_e0c
member device-alias UCSServ03_0a
exit

And for NexusB:
zone name UCSServ01_b vsan 102
member device-alias NAFAS01_e0d
member device-alias NAFAS02_e0d
member device-alias UCSServ01_0b
exit
zone name UCSServ02_b vsan 102
member device-alias NAFAS01_e0d
member device-alias NAFAS02_e0d
member device-alias UCSServ02_0b
exit
zone name UCSServ03_b vsan 102
member device-alias NAFAS01_e0d
member device-alias NAFAS02_e0d
member device-alias UCSServ03_0b
exit

So this is all of our zones created, now we just need to create and activate our zoneset and we have our completed zoning configuration. The commands to create and activate a zoneset are:
zoneset name <zoneset_name> vsan <vsan_id>
member <zone_1_name>
member <zone_2_name>
exit
zoneset activate name <zoneset_name> vsan <vsan_id>
exit

So now we have our NexusA configuration:
zoneset name UCSZonesetA vsan 101
member UCSServ01_a
member UCSServ02_a
member UCSServ03_a
exit
zoneset activate name UCSZonesetA vsan 101
exit

And our NexusB configuration:
zoneset name UCSZonesetB vsan 102
member UCSServ01_b
member UCSServ02_b
member UCSServ03_b
exit
zoneset activate name UCSZonesetB vsan 102
exit

So that’s how we compose our zoning configuration, and apply it to our Nexus switch. Hopefully this will be a useful reference on how to do this.

NetApp SnapCenter 1.0 – a new hope…

NetApp recently released version 1.0 of a new software offering going by the name of SnapCenter. It’s a long held tradition that 80% of NetApp’s releases contain the word ‘snap’, continuing to point out their ages old innovation in storage of snapshot technology providing efficient, speedy backups of your precious data.

 Screen Shot 2015-10-13 at 18.16.40

So what does SnapCenter bring to the table that we did not have before? Well first we need some context…

SnapDrive is Windows/UNIX software which taps into a NetApp storage system, allowing the provisioning, backup, restoration, and administration of storage resources without having to directly log onto the storage system. This enables application owners to take control of their own backup/restore operations and therefore feel more able to manage their data. For applications or server roles which are not subject to issues with inconsistency in backups the backup/restore features in SnapDrive are fine. Where applications are used which do have this concern, NetApp have provided another solution.

With me so far? Good. So SnapDrive is supplemented by the SnapManager suite of products. These have been built up over a long period of time by NetApp, and integrate directly with applications like:

  • SQL Server
  • Oracle
  • VMware
  • Hyper-V
  • Sharepoint
  • Exchange
  • SAP

These applications have vastly different purposes, but have equally unique requirements in terms of backing up their data in an application consistent way. Usually creating a backup/restore strategy which produces application consistent backups requires detailed understanding of the application, and is not integrated with the features presented by the underlying storage.

The SnapManager suite of products fills this gap, delivering a simplified, storage-integrated, application consistent method of easily backing up and restoring data, and providing the features that application owners desire. Further to this, it gives the application owners a simple GUI to take ownership of their own backup and recovery, whilst ensuring nothing in the underlying storage will break.

But this panacea to the challenge of backup and recovery, and its place within the application stack, is not without fault. Many criticisms have been levelled at the SnapManager suite over the years. The main two criticisms which I believe SnapCenter addresses are:

  1. Inconsistent user interfaces – the SnapManager suite was built up over time by NetApp, and many of the products were developed by different internal teams. This meant that the resultant software has very different looks and feels as you transition from one product to another. This complicates administration of the product for infrastructure administrators because they end up with multiple GUIs to learn, instead of a single GUI
  2. Scalability issues – to be fair to NetApp, this is not just an issue with their solution, a previous workplace of mine were heavy users of IBM’s Tivoli Storage Manager and that had a similar issue which is, as your environment grows, you may end up with tens of SQL servers, which means tens of instances of SnapManager for SQL to install, update, manage, and monitor, this could mean thousands upon thousands of reports and alerts to sift through each day, and without a solution to manage this, issues will go undiscovered for days, weeks or even months. Once you add in your Exchange environments, vCenter servers, Sharepoint farms, Oracle servers etc, you may be looking at tens of thousands of backups running a day, and potentially hundreds of pieces of installed software to manage and try to keep an eye on

So how does SnapCenter address this problem? Well, with the release of Clustered Data ONTAP (CDOT) 8.3 at the start of 2015, and the end of NetApp’s legacy 7-Mode operating system, there seems to have been a drive to revitalise their software and hardware lines, simplifying the available options, and pushing software interfaces to be web based, rather than thick GUIs.

So the value proposition with SnapCenter is a centrally managed point of reference to control your backups programatically, with a modern web based interface, and scalability to provide a workable solution regardless of the size of estate being backed up. So let’s look at these features, and how NetApp have delivered this:

1. Scalability

Scalability utilises the Windows NLB and ARR (Application Request Routing, basically a reverse web proxy) features to allow for the creation of a farm of SnapCenter servers up to the maximum size allowed by Windows NLB of 32 nodes.

SnapCenter utilises a SQL database as its back end, this can be either a local SQL Server Express instance (for small deployments), or a full SQL Server instance for scalable deployments.

2. Programability

NetApp have also been pretty decent at including programmability in their more recent software offerings, and SnapCenter is no exception, of course providing a PowerShell cmdlet pack, and of course the now ubiquitous REST API. SnapCenter is also policy-driven, which means once you have created your backup policy sets you can apply them to new datasets you want to backup going forward, this helps to keep manageability of backups under control as your infrastructure grows.

3. Interface

A web interface is a beautiful thing, accessing software from any browser on any OS makes life a lot easier for administrators, and not logging onto servers means less chance of breaking said servers. NetApp have chosen HTML5 for this interface which does away with the pain of having to deal with Java or Flash which plagues other web interfaces (UCS, VMware, I’m looking at you!). NetApp have raised the bar with the SnapCenter interface, producing a smart and stylish WUI not dissimilar to Microsoft’s Azure interface.

3506i6C464D71FA4BF802

Once you have installed the SnapCenter software on your Windows server, you will need to use the software to deploy the Windows and SQL Server plugins to your SQL servers. These plug-ins replace SnapDrive and SnapManager respectively, but this deployment process promises to be quick and painless, and a reboot should not be necessary. SnapCenter utilises the same licenses as SnapManager so if this is already licensed on your storage system then you are good to go. There is a migration feature present to help you move from SnapManager to SnapCenter, although this does not support migration of databases on VMDKs at this time.

The initial release of SnapCenter only interoperates with SQL Server, and VMware through the Virtual Storage Console (VSC), so it probably won’t replace many customer’s full SnapManager install bases just yet, but the delivery team are promising rollouts of more plugins over the coming months.

There are limitations even in the SQL backup/recovery capabilities, although these will likely not affect many customers, these are detailed in the product Release Notes, but the biggest of these from what I can see is that SnapCenter does not presently support SQL databases on SMB volumes.

Hopefully NetApp will provide regular and functionality enhancing updates to this product so that it delivers on its promises. It would also be good to see some functionality enhancements over what is currently delivered by the SnapManager products, top of the list from my perspective is allowing Exchange databases to reside on VMDK storage as the current restriction on this being purely LUN based makes things difficult, especially where customers are not deploying iSCSI, as this means the dreaded RDMs must be used in VMware, which as a VMware admin causes no end of headaches. It would also be nice to see this offered at some point as a virtual appliance, perhaps with an embedded PostgreSQL type database similar to what VMware offer for the vCenter Server Appliance, but that will be way down the line I would imagine as providing an appliance that scales well is a difficult thing.

NetApp have promised to continue to deliver SnapManager products for the time being, this is needed because of the lack of 7-Mode support in SnapCenter. Having worked extensively with both CDOT and 7-Mode though, I think there are many compelling reasons to move to CDOT if possible, and this seems like a fair compromise. SnapCenter can be installed quickly and tested out without committing to moving all your databases over to it, so give it a try, it’s the future after all!

NetApp Cluster Mode Data ONTAP (CDOT) 8.3 Reversion to 8.2 7-Mode

A project came in at work to build out a couple of new NetApp FAS2552 arrays; this was to replace old FAS2020s for a customer who was using FCP in their Production datacenter, and iSCSI in their DR datacenter, with a semi-synchronous Snapmirror relationship between the two.

The new arrays arrived on site, and we set them up separate from the production network, to configure them. We quickly identified that the 2552s were running OnTap 8.3RC1, which is how they were sent to us out of the factory. Nobody had any experience with Cluster Mode Data ONTAP, but this didn’t seem too much of a challenge, as it did not seem hugely different.

After looking what to do next, it appeared that transitioning SAN volumes from 7-mode to Cluster Mode Data ONTAP is not possible, so the decision was taken to downgrade the OS from 8.3RC1, to 8.2 7-mode to make the transition of the customer’s data, and the downtime during switchover from old arrays to new, be as easy and quick as possible.

We got there in the end, but due to the tomes of documentation we had to trawl through, and tie together, I decided to document the process, to assist any would be future CDOT luddites in carrying out this task.

NOTE: This has not been tested on anything other than a FAS2552 with two controllers, and if you are in any way uncertain I would suggest contacting NetApp support for assistance. As this was a brand new array, and there was no risk of data loss, we proceeded regardless. You will need a NetApp support account to access some of the documentation and downloads referenced below. This is the way we completed the downgrade, not saying it is the best way, and although I have many years experience of working with NetApp arrays, this is just a guide.

  • Downloading and updating the boot image:

We decided on 8.2.3 for our boot image, this was the last edition of Data ONTAP with 7-mode included. If you go to http://mysupport.netapp.com/NOW/cgi-bin/software/ and select your array type you will see the available versions for your array. There are pages of disclaimers to agree to, and documents of pre-requisites and release notes for each version, these are worth reading to ensure there are no known issues with your array type. Eventually you will get the download, it will be a .tgz file.

You will now need a system with IP connectivity to both controllers, and use something like FileZilla Server to host the file via FTP. This will allow you to get the file up to the controller. I am not going to include steps to setup your FTP server, but there are plenty of resources online to do this. You could also host this via HTTP using something like IIS if that is more convenient.

Now to pull the image onto the array, this will need doing on both controllers (nodes), this document was followed, specifically the following command (based on content on page 143):

 system node image get -node localhost -package <location> -replace-package true - background true

I changed the command to replace ‘-node *’ with ‘-node localhost’ so we could download the image to each node in turn, this was just to ensure we could tackle any issues with the download. I also removed the ‘-background true’ switch, which would run the download in the background, this was to give us maximum visibility.

Now our cluster had never been properly configured, there are a bunch of checks to do at this point to ensure your node is ready for the reversion, these are all detailed in the above document and should be followed to make sure nothing is amiss. We ran through these checks prior to installing the newly downloaded image. This includes things

Once happy, the image can be installed by running:

system node image update -node localhost -package file:///mroot/etc/software/<image_name>

The image name will be the name of the .tgz file you downloaded to the controller earlier (including the extension).

Once the image is installed, you can check the state of the installation with:

 system image show

This should show something like:

Screen Shot 2015-02-12 at 20.41.26

This shows the images for one controller only, but shows us the image we are reverting to is loaded into the system, and we can move on.

There are some more steps in the document to follow, ensuring the cluster is shutdown, and failover is disabled before we can revert, follow these from the same document as above.

Next we would normally run ‘revert_to 8.2’ to revert the firmware. However, we had issues at this point because of the ADP (Advanced Drive Partitioning), which seems to mark the disks as in a shared container. It goes into the background here, in Dan Barber’s excellent article. Long story short, we decided to reboot and format the array again to get round this.

  • Re-zeroing the disks and building new vol0:

We rebooted the first controller, and saw that when it came back up it was running in 8.2.3 (yay) Cluster Mode (boo). We tried zeroing the disks and building a new vol0, by interrupting the boot sequence with Ctrl+C to get to the special boot menu, and then running option 4, this was no good for us though, because once built, the controller booted into 8.2.3 Cluster Mode, a new tactic would be required.

We found this blog post on Krish Palamadathil’s blog, which detailed how to get around this. The downloaded image contains both Cluster Mode and 7-Mode images, but boots into Cluster Mode by default when doing this reversion. Cutting to the chase, the only thing we needed to do was to get to the Boot Loader (Ctrl+C during reboot to abort the boot process), and then run the following commands:

 LOADER> set-defaults 
 LOADER> boot_ontap

We then saw the controller come up in 8.2.3 7-Mode, interrupted the boot sequence, and ran an option 4 to zero the disks again and build a new vol0

Happy to say that the array is now at the correct version and in a state where it can now be configured. As usual, the NetApp documentation was great, even if we had to source steps from numerous different places. As this is still a very new version of Data ONTAP I would expect this documentation to get better over time, in the meantime hopefully this guide can be of use to people.

Storage I/O Control – what to expect

Storage I/O Control, or SIOC, was introduced into vSphere back in vSphere 4.1, it provides a way for vSphere to combat what is known as the ‘noisy neighbour’ syndrome. This describes the situation where multiple VMs reside on a single datastore, and one or more of these VMs take more than their fair share of bandwidth to the datastore. This could be happening because a VM decides to misbehave, because of poor choices in VM placement, or because workloads have changed.

The reigning principle behind SIOC is one of fairness, allowing all VMs a chance to read and write without being swamped by one or more ‘greedy’ VMs. This is something which, in the past, would have been controlled by disk shares, and indeed this method can still be used to prioritise certain workloads on a datastore over others. The advantage with SIOC is that, other than the couple of configurable settings, described below, no manual tinkering is really required.

Options available for Storage I/O Control
Options available for Storage I/O Control

There are only two settings to pick for SIOC:

1) SIOC Enabled/Disabled – either turn SIOC on, or off, at the datastore level. More on considerations for this further down

2) Congestion Threshold – this is the trigger point at which SIOC will kick in and start doing its thing, throttling I/O to the datastore. This can be configured with one of two types of value:

a) Manual – this is set in milliseconds and this defaults at 30ms, but is variable depending on your storage. VMware have tables on how to calculate this in their SIOC best practice guide, but the default should be fine for most situations. If in doubt then your storage provider should be able to give guidance on the correct value to choose.

b) Percentage of peak throughput – this is only available through the vSphere Web Client, and was added in vSphere 5.1, this takes the guess work out of setting the threshold, replacing it with an automated method for vSphere to analyse the datastore I/O capabilities and use this to determine the peak throughput.

My experience of using SIOC is described in the following paragraphs, improvements were seen, and no negative performance experienced (as expected), although some unexpected results were received.

Repeated latency warnings similar to the following from multiple hosts were seen, for multiple datastores across different storage systems:

Device naa.5000c5000b36354b performance has deteriorated. I/O latency increased from average value of 1832 microseconds to 19403 microseconds

These warnings report the latency time in microseconds, so in the above example, the latency is going from 1.8ms to 19ms, still a workable latency, but the rise is flagged due to the large increase (in this case by a factor of ten). The results seen in the logs were much worse than this though, sometimes latency was rising to as much as 20 seconds, this was happening mostly in the middle of the night

After checking out the storage configuration, it was identified that Storage I/O Control was turned off across the board. This is set to disabled by default for all datastores and as such, had been left as was. Turning SIOC on seemed like a sensible way forward so the decision was taken to proceed in turning it on for some of the worst affected datastores.

After turning on SIOC on a handful of datastores, a good reduction in the number of I/O latency doublings being reported in the ESXi logs was seen. Unfortunately a new message began to flag in the host events logs:

Non-VI workload detected on the datastore

This was repeatedly seen against the LUNs for which SIOC had been enabled, VMware have a knowledge base article for this which describes the issue. In this case, the problem stemmed from the fact that the storage backend providing the LUNs had a single disk pool (or mDisk Group, as this was presented by an IBM SVC) which was shared with unmanaged RDMs, and other storage presented outside the VMware environment.

The impact of this is that, whilst VMware plays nicely, throttling I/O access when threshold congestion is reached, other workloads such as non-SIOC datastores, RDMs, or other clients of the storage group, will not be so fair in their usage of the available bandwidth. This is due to the spindles presented being shared, one solution to this would be to present dedicated disk groups to VMware workloads, ensuring that all datastore carved out of these disks have SIOC turned on.

We use EMC VNX, and IBM SVC as our storage of choice, recommendations from both these vendors is to turn SIOC on for all datastores, and to leave it on. I can only imagine that the reason this is still not a default is because it is not suitable for every storage type. As with all these things, checking storage vendor documentation is probably the best option, but SIOC should provide benefit in most use cases, although as described above, you may see some unexpected results. It is worth noting that this feature is Enterprise Plus only, so anyone running a less feature packed version of vSphere will not be able to take advantage of this feature.