Incorrectly Reported Separated Network Partitions in VSAN Cluster

I’ve been playing around with VSAN, automating the build of a 3 node Management cluster using ESXi 6.0 Update 1. I came across and issue where I moved one of my hosts to another cluster and then back into the VSAN cluster, and when it came back it showed as a separate network partition, and had a separate VSAN datastore.

The VSAN Disk Management page under my cluster in the Web Client showed that the Network Partition Group was different for this host to my other two hosts, despite the network being absolutely fine.

Turned out that the host had not rejoined the VSAN cluster, but had created its own 1-node cluster. I resolved this by running the following commands:

On the partitioned host:

esxcli vsan cluster get

Cluster Information

   Enabled: true

   Current Local Time: 2016-09-21T10:23:35Z

   Local Node UUID: 57e0040c-83a9-add9-ec1f-0cc47ab46218

   Local Node Type: NORMAL

   Local Node State: MASTER

   Local Node Health State: HEALTHY

   Sub-Cluster Master UUID: 57e0040c-83a9-add9-ec1f-0cc47ab46218

   Sub-Cluster Backup UUID:

   Sub-Cluster UUID: 3451e257-cedd-8772-4b31-0cc47ab460e8

   Sub-Cluster Membership Entry Revision: 0

   Sub-Cluster Member Count: 1

   Sub-Cluster Member UUIDs: 57e0040c-83a9-add9-ec1f-0cc47ab46218

   Sub-Cluster Membership UUID: 9c5fe257-e053-7716-ca0a-0cc47ab46218

This shows the host in a single node cluster

On a surviving host:

esxcli vsan cluster get

Cluster Information

   Enabled: true

   Current Local Time: 2016-09-21T11:14:55Z

   Local Node UUID: 57e006b6-71ab-c8f6-7d1d-0cc47ab460e8

   Local Node Type: NORMAL

   Local Node State: MASTER

   Local Node Health State: HEALTHY

   Sub-Cluster Master UUID: 57e006b6-71ab-c8f6-7d1d-0cc47ab460e8

   Sub-Cluster Backup UUID: 57e0f22f-3071-fe1a-fd8e-0cc47ab460ec

   Sub-Cluster UUID: 57e0040c-83a9-add9-ec1f-0cc47ab46218

   Sub-Cluster Membership Entry Revision: 0

   Sub-Cluster Member Count: 2

   Sub-Cluster Member UUIDs: 57e0f22f-3071-fe1a-fd8e-0cc47ab460ec, 57e006b6-71ab-c8f6-7d1d-0cc47ab460e8

   Sub-Cluster Membership UUID: 3451e257-cedd-8772-4b31-0cc47ab460e8

This showed me there were only 2 nodes in the cluster, we will use the Sub-Cluster UUID from here in a moment.

On the partitioned host:

esxcli vsan cluster leave

esxcli vsan cluster join -u 57e0040c-83a9-add9-ec1f-0cc47ab46218

esxcli vsan cluster get

Cluster Information

   Enabled: true

   Current Local Time: 2016-09-21T10:24:26Z

   Local Node UUID: 57e0040c-83a9-add9-ec1f-0cc47ab46218

   Local Node Type: NORMAL

   Local Node State: AGENT

   Local Node Health State: HEALTHY

   Sub-Cluster Master UUID: 57e006b6-71ab-c8f6-7d1d-0cc47ab460e8

   Sub-Cluster Backup UUID: 57e0f22f-3071-fe1a-fd8e-0cc47ab460ec

   Sub-Cluster UUID: 57e0040c-83a9-add9-ec1f-0cc47ab46218

   Sub-Cluster Membership Entry Revision: 1

   Sub-Cluster Member Count: 3

   Sub-Cluster Member UUIDs: 57e0f22f-3071-fe1a-fd8e-0cc47ab460ec, 57e006b6-71ab-c8f6-7d1d-0cc47ab460e8, 57e0040c-83a9-add9-ec1f-0cc47ab46218

   Sub-Cluster Membership UUID: 3451e257-cedd-8772-4b31-0cc47ab460e8

Now we see all three nodes back in the cluster. The data will take some time to rebuild on this node, but once done, the VSAN health check should show as Healthy, and there should be a single VSAN datastore spanning all hosts.

FlexPod and UCS – where are we now?

I have been working on FlexPod for about a year now, and on UCS for a couple of years; in my opinion it’s great tech and takes a lot of the pain away from designing new data center infrastructure solutions, taking away the guesswork, and bringing inherent simplicity, reliability and resilience to the space. Over the last year or so, there has been a shift in the Cisco Validated Designs (CVDs) coming out of Cisco, and as things have moved forward, there is a noticeable shift in the way things are going with FlexPod. I should note that some of what I discuss here is already clear, and some is mere conjecture. I think the direction things are going in with FlexPod is a symptom of changes in the industry, but it is clear that converged solutions are becoming more and more desirable as the Enterprise technology world moves forward.

So what are the key changes we are seeing?

The SDN elephant in the room 

Cisco’s ACI (Application-centric Infrastructure) has taken a while to get moving; the replacement of existing 3-tier network architecture with a leaf-spine network is something which is not going to happen overnight. The switched fabric is arguably a great solution for modern data centers, where east-west traffic is the bulk of network activity, and Spanning Tree Protocol continues to be the bane of network admins, but its implementation often requires either green-field deployments, or forklift replacement of large portions of the existing core networking of a data center.

That’s not to say ACI is not doing OK in terms of sales; Cisco’s figures, and case studies, seem to show that there is uptake, and some large customers taking this on. So how does this fit in with FlexPod? Well, the Cisco Validated Designs (CVDs) released over the last 12 months have all included the new Nexus 9000 series switches, rather than the previous stalwart Nexus 5000 series. These are ACI based switches, which are also able to operate in the ‘legacy’ NX-OS mode. Capability wise, in the context of FlexPod, there is not a lot of difference, they now have FCoE support, can do vPC, QoS, Layer 3, and all the good stuff we come to expect from Nexus switches.

So from what I can gather, the inclusion of 9K switches in the FlexPod line (outside of the FlexPod with ACI designs), is there to enable FlexPod customers to more easily move into the leaf/spine ACI network architecture at a later date, should they wish to do this. This makes sense, and the pricing on the 9Ks being used looks favourable over the 5Ks, so this is a win-win for customers, even if they don’t eventually decide to go with ACI.

40GbE standard 

Recent announcements around the Gen 3 UCS Fabric Interconnects have revealed that 40GbE is now going to be the standard for UCS connectivity solutions, and the new chassis designs show 4 x 40GbE QSFP connections, totaling 320Gbps total bandwidth per chassis, this is an incredible throughput, and although I can’t see 99% of customers going anywhere near these levels, it does help to strengthen the UCS platform’s use cases for even the most high performance environments, and reduces the requirement for Infiniband type solutions for high throughput environments.

Another interesting point, and following on from the ACI ramblings above, is that the new 6300 series Fabric Interconnects are now based on the Nexus 9300 switching line, rather than the Nexus 5K based 6200 series. This positions them perfectly to act as a leaf in an ACI fabric one day, should this become the eventual outcome of Cisco’s strategy.

HTML5 FTW! 

With the announcements about the new UCS generation, came the news that from UCS firmware version 3.1, the software for UCS would now be unified for UCS classic, UCS Mini, and the newish M-series systems, this simplifies things for people looking at version numbers and how they relate to the relevance of the release, and means that there should now be relative feature parity across all footprints of UCS systems.

The most exciting, if you have experienced long running pain with Java, is that the new release incorporates the HTML5 interface, which has been seen on UCS Mini since its release. I’m sure this will bring new challenges with it, but for now at least, something fresh to look forward to for those running UCS classic.

FlexPod Mini – now not so mini 

 

FlexPod Mini is based on the UCS Mini release, which came out around 18 months ago, this is based on the I/O Modules (or FEXs) in the UCS 5108 chassis, being replaced with UCS 6324 pocket sized Fabric Interconnects, ultimately cramming a single chassis of UCS, and the attached network equipment into just 6 rack units. This could be expanded with C-series servers, but the scale for UCS blades, was strictly limited to the 8 blade limit of a standard chassis. With the announcement of the new Fabric Interconnect models came the news of a new ‘QSFP Scalability Port License’, which allows the 40GbE port on the UCS 6324 FI to be utilized with a 1 x QSFP to 4 x SFP+ cable, to add another chassis to the UCS Mini.

Personally I haven’t installed a UCS Mini, but the form factor is a great fit for certain use cases, and the more flexible this is, the more desire there will be to use this solution. For FlexPod, this ultimately means more suitable use cases, particularly in the ROBO (remote office/branch office) type scenario.

What’s Next? 

So with the FlexPod now having new switches, and new UCS hardware, it seems something new from NetApp is next on the list for FlexPod. The FAS8000 series is a couple of years old now, so we will likely see a refresh on this at some point, probably with 40GbE on board, more flash options, and faster CPUs. The recent purchase of SolidFire by NetApp will also quite probably see some kind of SolidFire based FlexPod CVD coming out of the Cisco/NetApp partnership in the near future.

We are also seeing the release of some exciting (or at least as exciting as these things can be!) new software releases this year: Windows Server 2016, vSphere 6.5 (assuming this will be revealed at VMworld), and OpenStack Mitaka, all of which will likely bring new CVDs.

In the zone…basic zoning on Cisco Nexus switches

In this post I look to go over some basic FCoE zoning concepts for Cisco Nexus switches, although FCoE has not really captured the imagination of the industry, it is used in a large number of Cisco infrastructure deployments, particularly around Nexus and UCS technologies. My experience is mostly based on FlexPods where we have this kind of design (this shows FCoE connectivity only, to keep things simple):

Screen Shot 2015-11-08 at 18.38.23

Zoning in a FlexPod is simple enough, we may have a largish number of hosts, but we are only zoning on two switches, only have 4 or 8 FCoE targets, depending on our configuration. In fact the zoning configuration can be fairly simply automated using PowerShell by tapping into the NetApp, Cisco UCS, and NX-OS APIs. The purpose of what we are doing here though is to describe the configuration steps required to complete the zoning.

The purpose of zoning is to restrict access to a LUN (Logical Unit Number), or essentially a Fibre Channel block device, on our storage, to one or more access devices, or hosts. This is useful in the case of boot disks, where we only ever want a single host accessing that device, and in the case of shared data devices, like cluster shared disks in Microsoft Clustering, or VMFS datastore in the VMware world, where we only want a subset of hosts to be able to access the device.

I found configuring zoning on a Cisco switch took a bit of getting my head around, so hopefully the explanation below will help to make this simpler for someone else.

From the Cisco UCS (or any server infrastructure you are running), you will need to gather a list of the WWPNs for the initiators wanting to connect to the storage. These will be in the format of 50:02:77:a4:10:0c:4e:21, this being a 16 byte hexadecimal number. Likewise, you will need to gather the WWPNs from your storage (in the case of FlexPod, your NetApp storage system).

Once we have these, we are ready to do our zoning on the switch. When doing the zoning configuration there are three main elements of configuration we need to understand:

  1. Aliases – these match the WWN to a friendly name, and sit in the device alias database on the switch. You can get away without using this, and just use the native WWNs later, but this will make things far more difficult should something go wrong. So basically these just match WWNs to devices.
  2. Zones – these logically group initiators and targets together, meaning that only the device aliases listed in the zone are able to talk to one another. This provides security and ease of management, a device can exist in more than one zone.
  3. Zonesets – this groups together zones, allowing the administrator to bring all the zones online or offline together. Only one zoneset can be active at a time.

On top of this, there is one more thing to understand when creating zoning on our Nexus switch, and that is the concept of a VSAN. A VSAN, or Virtual Storage Area Network, is the Fibre Channel equivalent of a VLAN. It is a logical collection of ports which together form a single discrete fabric.

So let’s create a fictional scenario, and create the zoning configuration for this. We have a FlexPod with two Nexus 5k switches, with separate fabrics, as shown in the diagram above, meaning that our 0a ports on the servers only go to Nexus on fabric A, and 0b ports only go to Nexus on fabric B. Both of our e0c ports on our NetApp storage go to NexusA, and both our e0d ports go to NexusB:

AFAS01 – e0c, e0d

NAFAS02 – e0c, e0d

And 3 Cisco UCS service profiles, each with two vHBAs, wanting to access storage on these targets, these are created as follows:

UCSServ01 – 0a, 0b

UCSServ02 – 0a, 0b

UCSServ03 – 0a, 0b

So on NexusA, we need the following aliases in our database:

Device Port WWPN Alias Name
NAFAS01 e0c 35:20:01:0c:11:22:33:44 NAFAS01_e0c
NAFAS02 e0c 35:20:02:0c:11:22:33:44 NAFAS01_e0c
UCSServ01 0a 50:02:77:a4:10:0c:0a:01 UCSServ01_0a
UCSServ02 0a 50:02:77:a4:10:0c:0a:02 UCSServ02_0a
UCSServ03 0a 50:02:77:a4:10:0c:0a:03 UCSServ03_0a

And on NexusB, we need the following:

Device Port WWPN Alias Name
NAFAS01 e0d 35:20:01:0d:11:22:33:44 NAFAS01_e0d
NAFAS02 e0d 35:20:02:0d:11:22:33:44 NAFAS02_e0d
UCSServ01 0b 50:02:77:a4:10:0c:0b:01 UCSServ01_0b
UCSServ02 0b 50:02:77:a4:10:0c:0b:02 UCSServ02_0b
UCSServ03 0b 50:02:77:a4:10:0c:0b:03 UCSServ03_0b

And the zones we need on each switch are, firstly for NexusA:

Zone Name Members
UCSServ01_a NAFAS01_e0c

NAFAS01_e0c

UCSServ01_0a

UCSServ02_a NAFAS01_e0c

NAFAS01_e0c

UCSServ02_0a

UCSServ03_a NAFAS01_e0c

NAFAS01_e0c

UCSServ03_0a

And for Nexus B:

Zone Name Members
UCSServ01_b NAFAS01_e0d

NAFAS01_e0d

UCSServ01_0b

UCSServ02_b NAFAS01_e0d

NAFAS01_e0d

UCSServ02_0b

UCSServ03_b NAFAS01_e0d

NAFAS01_e0d

UCSServ03_0b

This gives us a zone for each server to boot from, allowing that vHBA on the server to boot from either of the NetApp interfaces that it will be able to see on its fabric. The boot order itself will be controlled from within UCS, by creating zoning for the server to boot on either fabric we create resilience. All of this is just to demonstrate how we construct the zoning configuration so things will no doubt be different in a different environment.

So now we know what we should have in our populated alias database, and our zone configuration, we just need to create our zoneset. Well, we will have one zoneset per fabric, so one for NexusA:

Zoneset Name Members
UCSZonesetA UCSServ01_a

UCSServ02_a

UCSServ03_a

And the zoneset for NexusB:

Zoneset Name Members
UCSZonesetB UCSServ01_b

UCSServ02_b

UCSServ03_b

Now we are ready to put this into some NXOS CLI, and enter this on our switches. The general commands for creating new aliases are:
device-alias database
device-alias name <alias_name> pwwn <device_wwpn>
exit
device-alias commit

So for our NexusA, we do the following:
device-alias database
device-alias name NAFAS01_e0c pwwn 35:20:01:0c:11:22:33:44
device-alias name NAFAS02_e0c pwwn 35:20:02:0c:11:22:33:44
device-alias name UCSServ01_0a pwwn 50:02:77:a4:10:0c:0a:01
device-alias name UCSServ02_0a pwwn 50:02:77:a4:10:0c:0a:02
device-alias name UCSServ03_0a pwwn 50:02:77:a4:10:0c:0a:03
exit
device-alias commit

And for Nexus B, we do:
device-alias database
device-alias name NAFAS01_e0d pwwn 35:20:01:0d:11:22:33:44
device-alias name NAFAS02_e0d pwwn 35:20:02:0d:11:22:33:44
device-alias name UCSServ01_0b pwwn 50:02:77:a4:10:0c:0b:01
device-alias name UCSServ02_0b pwwn 50:02:77:a4:10:0c:0b:02
device-alias name UCSServ03_0b pwwn 50:02:77:a4:10:0c:0b:03
exit
device-alias commit

So that’s our alias database taken care of, now we can create our zones. The command set for creating a zone is:
zone name <zone_name> vsan <vsan_id>
member device-alias <device_1_alias>
member device-alias <device_2_alias>
member device-alias <device_3_alias>
exit

I will use VSAN IDs 101 for fabric A, and 102 for fabric B. So here we will create our zones for NexusA:
zone name UCSServ01_a vsan 101
member device-alias NAFAS01_e0c
member device-alias NAFAS02_e0c
member device-alias UCSServ01_0a
exit
zone name UCSServ02_a vsan 101
member device-alias NAFAS01_e0c
member device-alias NAFAS02_e0c
member device-alias UCSServ02_0a
exit
zone name UCSServ03_a vsan 101
member device-alias NAFAS01_e0c
member device-alias NAFAS02_e0c
member device-alias UCSServ03_0a
exit

And for NexusB:
zone name UCSServ01_b vsan 102
member device-alias NAFAS01_e0d
member device-alias NAFAS02_e0d
member device-alias UCSServ01_0b
exit
zone name UCSServ02_b vsan 102
member device-alias NAFAS01_e0d
member device-alias NAFAS02_e0d
member device-alias UCSServ02_0b
exit
zone name UCSServ03_b vsan 102
member device-alias NAFAS01_e0d
member device-alias NAFAS02_e0d
member device-alias UCSServ03_0b
exit

So this is all of our zones created, now we just need to create and activate our zoneset and we have our completed zoning configuration. The commands to create and activate a zoneset are:
zoneset name <zoneset_name> vsan <vsan_id>
member <zone_1_name>
member <zone_2_name>
exit
zoneset activate name <zoneset_name> vsan <vsan_id>
exit

So now we have our NexusA configuration:
zoneset name UCSZonesetA vsan 101
member UCSServ01_a
member UCSServ02_a
member UCSServ03_a
exit
zoneset activate name UCSZonesetA vsan 101
exit

And our NexusB configuration:
zoneset name UCSZonesetB vsan 102
member UCSServ01_b
member UCSServ02_b
member UCSServ03_b
exit
zoneset activate name UCSZonesetB vsan 102
exit

So that’s how we compose our zoning configuration, and apply it to our Nexus switch. Hopefully this will be a useful reference on how to do this.

NetApp SnapCenter 1.0 – a new hope…

NetApp recently released version 1.0 of a new software offering going by the name of SnapCenter. It’s a long held tradition that 80% of NetApp’s releases contain the word ‘snap’, continuing to point out their ages old innovation in storage of snapshot technology providing efficient, speedy backups of your precious data.

 Screen Shot 2015-10-13 at 18.16.40

So what does SnapCenter bring to the table that we did not have before? Well first we need some context…

SnapDrive is Windows/UNIX software which taps into a NetApp storage system, allowing the provisioning, backup, restoration, and administration of storage resources without having to directly log onto the storage system. This enables application owners to take control of their own backup/restore operations and therefore feel more able to manage their data. For applications or server roles which are not subject to issues with inconsistency in backups the backup/restore features in SnapDrive are fine. Where applications are used which do have this concern, NetApp have provided another solution.

With me so far? Good. So SnapDrive is supplemented by the SnapManager suite of products. These have been built up over a long period of time by NetApp, and integrate directly with applications like:

  • SQL Server
  • Oracle
  • VMware
  • Hyper-V
  • Sharepoint
  • Exchange
  • SAP

These applications have vastly different purposes, but have equally unique requirements in terms of backing up their data in an application consistent way. Usually creating a backup/restore strategy which produces application consistent backups requires detailed understanding of the application, and is not integrated with the features presented by the underlying storage.

The SnapManager suite of products fills this gap, delivering a simplified, storage-integrated, application consistent method of easily backing up and restoring data, and providing the features that application owners desire. Further to this, it gives the application owners a simple GUI to take ownership of their own backup and recovery, whilst ensuring nothing in the underlying storage will break.

But this panacea to the challenge of backup and recovery, and its place within the application stack, is not without fault. Many criticisms have been levelled at the SnapManager suite over the years. The main two criticisms which I believe SnapCenter addresses are:

  1. Inconsistent user interfaces – the SnapManager suite was built up over time by NetApp, and many of the products were developed by different internal teams. This meant that the resultant software has very different looks and feels as you transition from one product to another. This complicates administration of the product for infrastructure administrators because they end up with multiple GUIs to learn, instead of a single GUI
  2. Scalability issues – to be fair to NetApp, this is not just an issue with their solution, a previous workplace of mine were heavy users of IBM’s Tivoli Storage Manager and that had a similar issue which is, as your environment grows, you may end up with tens of SQL servers, which means tens of instances of SnapManager for SQL to install, update, manage, and monitor, this could mean thousands upon thousands of reports and alerts to sift through each day, and without a solution to manage this, issues will go undiscovered for days, weeks or even months. Once you add in your Exchange environments, vCenter servers, Sharepoint farms, Oracle servers etc, you may be looking at tens of thousands of backups running a day, and potentially hundreds of pieces of installed software to manage and try to keep an eye on

So how does SnapCenter address this problem? Well, with the release of Clustered Data ONTAP (CDOT) 8.3 at the start of 2015, and the end of NetApp’s legacy 7-Mode operating system, there seems to have been a drive to revitalise their software and hardware lines, simplifying the available options, and pushing software interfaces to be web based, rather than thick GUIs.

So the value proposition with SnapCenter is a centrally managed point of reference to control your backups programatically, with a modern web based interface, and scalability to provide a workable solution regardless of the size of estate being backed up. So let’s look at these features, and how NetApp have delivered this:

1. Scalability

Scalability utilises the Windows NLB and ARR (Application Request Routing, basically a reverse web proxy) features to allow for the creation of a farm of SnapCenter servers up to the maximum size allowed by Windows NLB of 32 nodes.

SnapCenter utilises a SQL database as its back end, this can be either a local SQL Server Express instance (for small deployments), or a full SQL Server instance for scalable deployments.

2. Programability

NetApp have also been pretty decent at including programmability in their more recent software offerings, and SnapCenter is no exception, of course providing a PowerShell cmdlet pack, and of course the now ubiquitous REST API. SnapCenter is also policy-driven, which means once you have created your backup policy sets you can apply them to new datasets you want to backup going forward, this helps to keep manageability of backups under control as your infrastructure grows.

3. Interface

A web interface is a beautiful thing, accessing software from any browser on any OS makes life a lot easier for administrators, and not logging onto servers means less chance of breaking said servers. NetApp have chosen HTML5 for this interface which does away with the pain of having to deal with Java or Flash which plagues other web interfaces (UCS, VMware, I’m looking at you!). NetApp have raised the bar with the SnapCenter interface, producing a smart and stylish WUI not dissimilar to Microsoft’s Azure interface.

3506i6C464D71FA4BF802

Once you have installed the SnapCenter software on your Windows server, you will need to use the software to deploy the Windows and SQL Server plugins to your SQL servers. These plug-ins replace SnapDrive and SnapManager respectively, but this deployment process promises to be quick and painless, and a reboot should not be necessary. SnapCenter utilises the same licenses as SnapManager so if this is already licensed on your storage system then you are good to go. There is a migration feature present to help you move from SnapManager to SnapCenter, although this does not support migration of databases on VMDKs at this time.

The initial release of SnapCenter only interoperates with SQL Server, and VMware through the Virtual Storage Console (VSC), so it probably won’t replace many customer’s full SnapManager install bases just yet, but the delivery team are promising rollouts of more plugins over the coming months.

There are limitations even in the SQL backup/recovery capabilities, although these will likely not affect many customers, these are detailed in the product Release Notes, but the biggest of these from what I can see is that SnapCenter does not presently support SQL databases on SMB volumes.

Hopefully NetApp will provide regular and functionality enhancing updates to this product so that it delivers on its promises. It would also be good to see some functionality enhancements over what is currently delivered by the SnapManager products, top of the list from my perspective is allowing Exchange databases to reside on VMDK storage as the current restriction on this being purely LUN based makes things difficult, especially where customers are not deploying iSCSI, as this means the dreaded RDMs must be used in VMware, which as a VMware admin causes no end of headaches. It would also be nice to see this offered at some point as a virtual appliance, perhaps with an embedded PostgreSQL type database similar to what VMware offer for the vCenter Server Appliance, but that will be way down the line I would imagine as providing an appliance that scales well is a difficult thing.

NetApp have promised to continue to deliver SnapManager products for the time being, this is needed because of the lack of 7-Mode support in SnapCenter. Having worked extensively with both CDOT and 7-Mode though, I think there are many compelling reasons to move to CDOT if possible, and this seems like a fair compromise. SnapCenter can be installed quickly and tested out without committing to moving all your databases over to it, so give it a try, it’s the future after all!

NetApp Cluster Mode Data ONTAP (CDOT) 8.3 Reversion to 8.2 7-Mode

A project came in at work to build out a couple of new NetApp FAS2552 arrays; this was to replace old FAS2020s for a customer who was using FCP in their Production datacenter, and iSCSI in their DR datacenter, with a semi-synchronous Snapmirror relationship between the two.

The new arrays arrived on site, and we set them up separate from the production network, to configure them. We quickly identified that the 2552s were running OnTap 8.3RC1, which is how they were sent to us out of the factory. Nobody had any experience with Cluster Mode Data ONTAP, but this didn’t seem too much of a challenge, as it did not seem hugely different.

After looking what to do next, it appeared that transitioning SAN volumes from 7-mode to Cluster Mode Data ONTAP is not possible, so the decision was taken to downgrade the OS from 8.3RC1, to 8.2 7-mode to make the transition of the customer’s data, and the downtime during switchover from old arrays to new, be as easy and quick as possible.

We got there in the end, but due to the tomes of documentation we had to trawl through, and tie together, I decided to document the process, to assist any would be future CDOT luddites in carrying out this task.

NOTE: This has not been tested on anything other than a FAS2552 with two controllers, and if you are in any way uncertain I would suggest contacting NetApp support for assistance. As this was a brand new array, and there was no risk of data loss, we proceeded regardless. You will need a NetApp support account to access some of the documentation and downloads referenced below. This is the way we completed the downgrade, not saying it is the best way, and although I have many years experience of working with NetApp arrays, this is just a guide.

  • Downloading and updating the boot image:

We decided on 8.2.3 for our boot image, this was the last edition of Data ONTAP with 7-mode included. If you go to http://mysupport.netapp.com/NOW/cgi-bin/software/ and select your array type you will see the available versions for your array. There are pages of disclaimers to agree to, and documents of pre-requisites and release notes for each version, these are worth reading to ensure there are no known issues with your array type. Eventually you will get the download, it will be a .tgz file.

You will now need a system with IP connectivity to both controllers, and use something like FileZilla Server to host the file via FTP. This will allow you to get the file up to the controller. I am not going to include steps to setup your FTP server, but there are plenty of resources online to do this. You could also host this via HTTP using something like IIS if that is more convenient.

Now to pull the image onto the array, this will need doing on both controllers (nodes), this document was followed, specifically the following command (based on content on page 143):

 system node image get -node localhost -package <location> -replace-package true - background true

I changed the command to replace ‘-node *’ with ‘-node localhost’ so we could download the image to each node in turn, this was just to ensure we could tackle any issues with the download. I also removed the ‘-background true’ switch, which would run the download in the background, this was to give us maximum visibility.

Now our cluster had never been properly configured, there are a bunch of checks to do at this point to ensure your node is ready for the reversion, these are all detailed in the above document and should be followed to make sure nothing is amiss. We ran through these checks prior to installing the newly downloaded image. This includes things

Once happy, the image can be installed by running:

system node image update -node localhost -package file:///mroot/etc/software/<image_name>

The image name will be the name of the .tgz file you downloaded to the controller earlier (including the extension).

Once the image is installed, you can check the state of the installation with:

 system image show

This should show something like:

Screen Shot 2015-02-12 at 20.41.26

This shows the images for one controller only, but shows us the image we are reverting to is loaded into the system, and we can move on.

There are some more steps in the document to follow, ensuring the cluster is shutdown, and failover is disabled before we can revert, follow these from the same document as above.

Next we would normally run ‘revert_to 8.2’ to revert the firmware. However, we had issues at this point because of the ADP (Advanced Drive Partitioning), which seems to mark the disks as in a shared container. It goes into the background here, in Dan Barber’s excellent article. Long story short, we decided to reboot and format the array again to get round this.

  • Re-zeroing the disks and building new vol0:

We rebooted the first controller, and saw that when it came back up it was running in 8.2.3 (yay) Cluster Mode (boo). We tried zeroing the disks and building a new vol0, by interrupting the boot sequence with Ctrl+C to get to the special boot menu, and then running option 4, this was no good for us though, because once built, the controller booted into 8.2.3 Cluster Mode, a new tactic would be required.

We found this blog post on Krish Palamadathil’s blog, which detailed how to get around this. The downloaded image contains both Cluster Mode and 7-Mode images, but boots into Cluster Mode by default when doing this reversion. Cutting to the chase, the only thing we needed to do was to get to the Boot Loader (Ctrl+C during reboot to abort the boot process), and then run the following commands:

 LOADER> set-defaults 
 LOADER> boot_ontap

We then saw the controller come up in 8.2.3 7-Mode, interrupted the boot sequence, and ran an option 4 to zero the disks again and build a new vol0

Happy to say that the array is now at the correct version and in a state where it can now be configured. As usual, the NetApp documentation was great, even if we had to source steps from numerous different places. As this is still a very new version of Data ONTAP I would expect this documentation to get better over time, in the meantime hopefully this guide can be of use to people.

VM Swap File location considerations

When a VM is powered on, a .vswp (Virtual Machine swap) file is created (note there is also a vmx swap file which gets created in the same location as the VM, this is seperate from this discussion, but is described here), its size is the memory allocation for the VM less any reservation configured. If there is not sufficient space in the configured swap file location to create this file then the VM will not power on. The use of this file for memory pages is a last resort, and will be considerably slower than using normal memory, even if this is compressed or shared. It will always be possible that you get in a situation where memory contention is occurring, and that the use of the swap file begins, to prepare for this a system design should consider the location of swap files for VMs. Below I discuss some of the considerations which should be made when placing a VM swap file:

  • Default location for swap file is to store it in the same datastore as the VM, this presents the following problems:
    • Performance – it is unlikely that the datastore the VM sits in is on top tier storage, or limited to a single VM. This means that the difference in speed between memory IO and the swapping IO once contention occurs will be great, and that the additional IO this swapping produces could well impact other workloads on the datastore. If there are multiple VMs sharing this datastore, and all running on the host with memory contention issues, then this will be further compounded and could see the datastore performance plummet
    • Capacity – inevitably, administrators will keep chucking workloads into a datastore, and unless Storage DRS with datastore clusters is being used, or the administrators are pro-active in balancing storage workloads, there will come a time when a VM will not power up due to insufficient space to create the .vswp file. This is particularly likely after a change to the VM configuration such as adding more disk or memory

VM swap file location can be changed either at the cluster, or host level. When choosing to move this from the default, the following should be considered:

  • Place swap files on the fastest storage possible – if you can place this on flash storage then fantastic, this will not be as quick as paging to/from memory, but it will be many magnitudes better than placing it on spinning disk
  • Place swap files as close to the host as possible – the latency incurred by traversing your SAN/IP network to get to shared storage will all impair guest performance when swapping occurs. Be aware that although the default location can be changed to host local storage (which will probably give the best performance of the host has internal flash storage), this will impair vMotion performance massively, as the entire .vswp file would need to be copied from the source host to the destination host’s disk during the vMotion activity
  • Do not place the .vswp on replicated storage – as with location selection for guest OS swap files, there is no point on placing the file on replicated storage; if the VM is reset or powered off then this file is deleted. If your VMs are on storage which is replicated as part of its standard capability then the .vswp files should definitely be located elsewhere

 

In terms of configuring the location, as stated above, this is set at either a VM, host or cluster level, if this is inconsistent across hosts in a cluster then again this may impact vMotion times as the VM migrates from a host with one configured location to another with a different location. As with most settings which can be made at the cluster level, consistency should be maintained across the cluster unless this is not possible. Bear in mind though, that having vswp consistent across the cluster, and defined to be a single datastore, could lead to high IOPS on this datastore should cluster wide memory contention occur., especially with large clusters.

As stated at the beginning of this article, swap files are sized based on VM memory allocation less reservations size. By right sizing VMs, and utilising reservations, swap file sizes, and usage, can be kept to a minimum, and these planning considerations should take precedence over all others. Hopefully memory contention will never be so bad that swap will be required, but when the day does come it is good to be prepared, by making informed, and reasoned decisions early on.

VMFS Extents – to extend or not to extend, that is the question

Extents allow disk presented to a vSphere system to be added to VMFS datastore to extend the file system, this aggregates multiple disks together and can be useful in a number of scenarios. Recently I saw problems where extents were being used spanning two storage systems; one of the storage systems had a controller failure which caused SCSI reservation issues on one of the LUNs making up the extent and this caused the entire datastore to go offline.

In this article I want to discuss some of the benefits and potential pitfalls in using VMFS extents in vSphere environments. Ultimately this is an available, supported, and sometimes useful feature of vSphere but there are some limitations or weaknesses that using this can bring.

Advantages:

  • Using extents allows you to create datastores up to the maximum supported by vSphere for pre-VMFS-5 datastores. It can be useful to create large datastores for the following reasons:
    • There may be a requirement to natively present a VMDK which is larger than the maximum LUN size available on your storage system. For example, if 2TB is the largest LUN you can present, but you need a 4TB disk for the application your VM is hosting, the aggregation of disks will allow the creation of a VMFS datastore large enough to deliver this without the need to span volumes in the guest OS, or the need to fallback to using something like RDMs which may impinge on other vSphere functionality
    • Datastore management will be simplified with fewer VMFS datastores required. The fewer datastores available, the less an administrator has to keep their eyes on. In addition to this, decisions made in placing VMs is made considerably simpler if there are fewer choices
  • Adding space to a datastore with capacity issues; in a previous role we were constrained by storage space more than any other resource, this meant that on both the storage system (NetApp FAS2050 with a single shelf of storage), and at the VMFS level, the design left little to no room to extend a VMDK should it be required. If we did need to add space to VMDKs, we had to extend the volume and LUN by the required amount on the filer, and add a small extent to the datastore in vSphere

Disadvantages:

  • Introduces a single point of failure; whether you are aggregating disks from one or multiple storage systems, by adding extents to a volume the head extent in the aggregated datastore (the first LUN added to the datastore) becomes a single point of failure, if any of the LUNs should become unavailable then VMs which have any blocks whatsoever on the lost LUN will no longer be available
  • Management from the storage side can become more difficult, given that there may be multiple LUNs, from multiple storage systems now aggregated to form a single datastore, from a storage side it is harder to identify which LUNs relate to which datastores in vSphere, to combat this it is important to document the addition of extents well, and label LUNs accordingly on the storage system
  • If extents are combined which span different storage devices then there may well be a loss in performance

The above is all just based on my experiences, but it seems there are legitimate use cases for choosing to use, or not use extents. My personal preference would be to present a new larger LUN where possible, formatting this in VMFS, and using Storage vMotion to migrate VMs to the new datastore. Given that since VMFS-5 introduced GPT as the partitioning method for LUNs, we can now create single extent datastores up to 64TB in size, the requirement for using extents should be diminished. There are often legitimate reasons, especially in older environments, why this is not practical or possible however, and in these cases using extents is perfectly valid.

Storage I/O Control – what to expect

Storage I/O Control, or SIOC, was introduced into vSphere back in vSphere 4.1, it provides a way for vSphere to combat what is known as the ‘noisy neighbour’ syndrome. This describes the situation where multiple VMs reside on a single datastore, and one or more of these VMs take more than their fair share of bandwidth to the datastore. This could be happening because a VM decides to misbehave, because of poor choices in VM placement, or because workloads have changed.

The reigning principle behind SIOC is one of fairness, allowing all VMs a chance to read and write without being swamped by one or more ‘greedy’ VMs. This is something which, in the past, would have been controlled by disk shares, and indeed this method can still be used to prioritise certain workloads on a datastore over others. The advantage with SIOC is that, other than the couple of configurable settings, described below, no manual tinkering is really required.

Options available for Storage I/O Control
Options available for Storage I/O Control

There are only two settings to pick for SIOC:

1) SIOC Enabled/Disabled – either turn SIOC on, or off, at the datastore level. More on considerations for this further down

2) Congestion Threshold – this is the trigger point at which SIOC will kick in and start doing its thing, throttling I/O to the datastore. This can be configured with one of two types of value:

a) Manual – this is set in milliseconds and this defaults at 30ms, but is variable depending on your storage. VMware have tables on how to calculate this in their SIOC best practice guide, but the default should be fine for most situations. If in doubt then your storage provider should be able to give guidance on the correct value to choose.

b) Percentage of peak throughput – this is only available through the vSphere Web Client, and was added in vSphere 5.1, this takes the guess work out of setting the threshold, replacing it with an automated method for vSphere to analyse the datastore I/O capabilities and use this to determine the peak throughput.

My experience of using SIOC is described in the following paragraphs, improvements were seen, and no negative performance experienced (as expected), although some unexpected results were received.

Repeated latency warnings similar to the following from multiple hosts were seen, for multiple datastores across different storage systems:

Device naa.5000c5000b36354b performance has deteriorated. I/O latency increased from average value of 1832 microseconds to 19403 microseconds

These warnings report the latency time in microseconds, so in the above example, the latency is going from 1.8ms to 19ms, still a workable latency, but the rise is flagged due to the large increase (in this case by a factor of ten). The results seen in the logs were much worse than this though, sometimes latency was rising to as much as 20 seconds, this was happening mostly in the middle of the night

After checking out the storage configuration, it was identified that Storage I/O Control was turned off across the board. This is set to disabled by default for all datastores and as such, had been left as was. Turning SIOC on seemed like a sensible way forward so the decision was taken to proceed in turning it on for some of the worst affected datastores.

After turning on SIOC on a handful of datastores, a good reduction in the number of I/O latency doublings being reported in the ESXi logs was seen. Unfortunately a new message began to flag in the host events logs:

Non-VI workload detected on the datastore

This was repeatedly seen against the LUNs for which SIOC had been enabled, VMware have a knowledge base article for this which describes the issue. In this case, the problem stemmed from the fact that the storage backend providing the LUNs had a single disk pool (or mDisk Group, as this was presented by an IBM SVC) which was shared with unmanaged RDMs, and other storage presented outside the VMware environment.

The impact of this is that, whilst VMware plays nicely, throttling I/O access when threshold congestion is reached, other workloads such as non-SIOC datastores, RDMs, or other clients of the storage group, will not be so fair in their usage of the available bandwidth. This is due to the spindles presented being shared, one solution to this would be to present dedicated disk groups to VMware workloads, ensuring that all datastore carved out of these disks have SIOC turned on.

We use EMC VNX, and IBM SVC as our storage of choice, recommendations from both these vendors is to turn SIOC on for all datastores, and to leave it on. I can only imagine that the reason this is still not a default is because it is not suitable for every storage type. As with all these things, checking storage vendor documentation is probably the best option, but SIOC should provide benefit in most use cases, although as described above, you may see some unexpected results. It is worth noting that this feature is Enterprise Plus only, so anyone running a less feature packed version of vSphere will not be able to take advantage of this feature.