FlexPod and UCS – where are we now?

I have been working on FlexPod for about a year now, and on UCS for a couple of years; in my opinion it’s great tech and takes a lot of the pain away from designing new data center infrastructure solutions, taking away the guesswork, and bringing inherent simplicity, reliability and resilience to the space. Over the last year or so, there has been a shift in the Cisco Validated Designs (CVDs) coming out of Cisco, and as things have moved forward, there is a noticeable shift in the way things are going with FlexPod. I should note that some of what I discuss here is already clear, and some is mere conjecture. I think the direction things are going in with FlexPod is a symptom of changes in the industry, but it is clear that converged solutions are becoming more and more desirable as the Enterprise technology world moves forward.

So what are the key changes we are seeing?

The SDN elephant in the room 

Cisco’s ACI (Application-centric Infrastructure) has taken a while to get moving; the replacement of existing 3-tier network architecture with a leaf-spine network is something which is not going to happen overnight. The switched fabric is arguably a great solution for modern data centers, where east-west traffic is the bulk of network activity, and Spanning Tree Protocol continues to be the bane of network admins, but its implementation often requires either green-field deployments, or forklift replacement of large portions of the existing core networking of a data center.

That’s not to say ACI is not doing OK in terms of sales; Cisco’s figures, and case studies, seem to show that there is uptake, and some large customers taking this on. So how does this fit in with FlexPod? Well, the Cisco Validated Designs (CVDs) released over the last 12 months have all included the new Nexus 9000 series switches, rather than the previous stalwart Nexus 5000 series. These are ACI based switches, which are also able to operate in the ‘legacy’ NX-OS mode. Capability wise, in the context of FlexPod, there is not a lot of difference, they now have FCoE support, can do vPC, QoS, Layer 3, and all the good stuff we come to expect from Nexus switches.

So from what I can gather, the inclusion of 9K switches in the FlexPod line (outside of the FlexPod with ACI designs), is there to enable FlexPod customers to more easily move into the leaf/spine ACI network architecture at a later date, should they wish to do this. This makes sense, and the pricing on the 9Ks being used looks favourable over the 5Ks, so this is a win-win for customers, even if they don’t eventually decide to go with ACI.

40GbE standard 

Recent announcements around the Gen 3 UCS Fabric Interconnects have revealed that 40GbE is now going to be the standard for UCS connectivity solutions, and the new chassis designs show 4 x 40GbE QSFP connections, totaling 320Gbps total bandwidth per chassis, this is an incredible throughput, and although I can’t see 99% of customers going anywhere near these levels, it does help to strengthen the UCS platform’s use cases for even the most high performance environments, and reduces the requirement for Infiniband type solutions for high throughput environments.

Another interesting point, and following on from the ACI ramblings above, is that the new 6300 series Fabric Interconnects are now based on the Nexus 9300 switching line, rather than the Nexus 5K based 6200 series. This positions them perfectly to act as a leaf in an ACI fabric one day, should this become the eventual outcome of Cisco’s strategy.

HTML5 FTW! 

With the announcements about the new UCS generation, came the news that from UCS firmware version 3.1, the software for UCS would now be unified for UCS classic, UCS Mini, and the newish M-series systems, this simplifies things for people looking at version numbers and how they relate to the relevance of the release, and means that there should now be relative feature parity across all footprints of UCS systems.

The most exciting, if you have experienced long running pain with Java, is that the new release incorporates the HTML5 interface, which has been seen on UCS Mini since its release. I’m sure this will bring new challenges with it, but for now at least, something fresh to look forward to for those running UCS classic.

FlexPod Mini – now not so mini 

 

FlexPod Mini is based on the UCS Mini release, which came out around 18 months ago, this is based on the I/O Modules (or FEXs) in the UCS 5108 chassis, being replaced with UCS 6324 pocket sized Fabric Interconnects, ultimately cramming a single chassis of UCS, and the attached network equipment into just 6 rack units. This could be expanded with C-series servers, but the scale for UCS blades, was strictly limited to the 8 blade limit of a standard chassis. With the announcement of the new Fabric Interconnect models came the news of a new ‘QSFP Scalability Port License’, which allows the 40GbE port on the UCS 6324 FI to be utilized with a 1 x QSFP to 4 x SFP+ cable, to add another chassis to the UCS Mini.

Personally I haven’t installed a UCS Mini, but the form factor is a great fit for certain use cases, and the more flexible this is, the more desire there will be to use this solution. For FlexPod, this ultimately means more suitable use cases, particularly in the ROBO (remote office/branch office) type scenario.

What’s Next? 

So with the FlexPod now having new switches, and new UCS hardware, it seems something new from NetApp is next on the list for FlexPod. The FAS8000 series is a couple of years old now, so we will likely see a refresh on this at some point, probably with 40GbE on board, more flash options, and faster CPUs. The recent purchase of SolidFire by NetApp will also quite probably see some kind of SolidFire based FlexPod CVD coming out of the Cisco/NetApp partnership in the near future.

We are also seeing the release of some exciting (or at least as exciting as these things can be!) new software releases this year: Windows Server 2016, vSphere 6.5 (assuming this will be revealed at VMworld), and OpenStack Mitaka, all of which will likely bring new CVDs.

In the zone…basic zoning on Cisco Nexus switches

In this post I look to go over some basic FCoE zoning concepts for Cisco Nexus switches, although FCoE has not really captured the imagination of the industry, it is used in a large number of Cisco infrastructure deployments, particularly around Nexus and UCS technologies. My experience is mostly based on FlexPods where we have this kind of design (this shows FCoE connectivity only, to keep things simple):

Screen Shot 2015-11-08 at 18.38.23

Zoning in a FlexPod is simple enough, we may have a largish number of hosts, but we are only zoning on two switches, only have 4 or 8 FCoE targets, depending on our configuration. In fact the zoning configuration can be fairly simply automated using PowerShell by tapping into the NetApp, Cisco UCS, and NX-OS APIs. The purpose of what we are doing here though is to describe the configuration steps required to complete the zoning.

The purpose of zoning is to restrict access to a LUN (Logical Unit Number), or essentially a Fibre Channel block device, on our storage, to one or more access devices, or hosts. This is useful in the case of boot disks, where we only ever want a single host accessing that device, and in the case of shared data devices, like cluster shared disks in Microsoft Clustering, or VMFS datastore in the VMware world, where we only want a subset of hosts to be able to access the device.

I found configuring zoning on a Cisco switch took a bit of getting my head around, so hopefully the explanation below will help to make this simpler for someone else.

From the Cisco UCS (or any server infrastructure you are running), you will need to gather a list of the WWPNs for the initiators wanting to connect to the storage. These will be in the format of 50:02:77:a4:10:0c:4e:21, this being a 16 byte hexadecimal number. Likewise, you will need to gather the WWPNs from your storage (in the case of FlexPod, your NetApp storage system).

Once we have these, we are ready to do our zoning on the switch. When doing the zoning configuration there are three main elements of configuration we need to understand:

  1. Aliases – these match the WWN to a friendly name, and sit in the device alias database on the switch. You can get away without using this, and just use the native WWNs later, but this will make things far more difficult should something go wrong. So basically these just match WWNs to devices.
  2. Zones – these logically group initiators and targets together, meaning that only the device aliases listed in the zone are able to talk to one another. This provides security and ease of management, a device can exist in more than one zone.
  3. Zonesets – this groups together zones, allowing the administrator to bring all the zones online or offline together. Only one zoneset can be active at a time.

On top of this, there is one more thing to understand when creating zoning on our Nexus switch, and that is the concept of a VSAN. A VSAN, or Virtual Storage Area Network, is the Fibre Channel equivalent of a VLAN. It is a logical collection of ports which together form a single discrete fabric.

So let’s create a fictional scenario, and create the zoning configuration for this. We have a FlexPod with two Nexus 5k switches, with separate fabrics, as shown in the diagram above, meaning that our 0a ports on the servers only go to Nexus on fabric A, and 0b ports only go to Nexus on fabric B. Both of our e0c ports on our NetApp storage go to NexusA, and both our e0d ports go to NexusB:

AFAS01 – e0c, e0d

NAFAS02 – e0c, e0d

And 3 Cisco UCS service profiles, each with two vHBAs, wanting to access storage on these targets, these are created as follows:

UCSServ01 – 0a, 0b

UCSServ02 – 0a, 0b

UCSServ03 – 0a, 0b

So on NexusA, we need the following aliases in our database:

Device Port WWPN Alias Name
NAFAS01 e0c 35:20:01:0c:11:22:33:44 NAFAS01_e0c
NAFAS02 e0c 35:20:02:0c:11:22:33:44 NAFAS01_e0c
UCSServ01 0a 50:02:77:a4:10:0c:0a:01 UCSServ01_0a
UCSServ02 0a 50:02:77:a4:10:0c:0a:02 UCSServ02_0a
UCSServ03 0a 50:02:77:a4:10:0c:0a:03 UCSServ03_0a

And on NexusB, we need the following:

Device Port WWPN Alias Name
NAFAS01 e0d 35:20:01:0d:11:22:33:44 NAFAS01_e0d
NAFAS02 e0d 35:20:02:0d:11:22:33:44 NAFAS02_e0d
UCSServ01 0b 50:02:77:a4:10:0c:0b:01 UCSServ01_0b
UCSServ02 0b 50:02:77:a4:10:0c:0b:02 UCSServ02_0b
UCSServ03 0b 50:02:77:a4:10:0c:0b:03 UCSServ03_0b

And the zones we need on each switch are, firstly for NexusA:

Zone Name Members
UCSServ01_a NAFAS01_e0c

NAFAS01_e0c

UCSServ01_0a

UCSServ02_a NAFAS01_e0c

NAFAS01_e0c

UCSServ02_0a

UCSServ03_a NAFAS01_e0c

NAFAS01_e0c

UCSServ03_0a

And for Nexus B:

Zone Name Members
UCSServ01_b NAFAS01_e0d

NAFAS01_e0d

UCSServ01_0b

UCSServ02_b NAFAS01_e0d

NAFAS01_e0d

UCSServ02_0b

UCSServ03_b NAFAS01_e0d

NAFAS01_e0d

UCSServ03_0b

This gives us a zone for each server to boot from, allowing that vHBA on the server to boot from either of the NetApp interfaces that it will be able to see on its fabric. The boot order itself will be controlled from within UCS, by creating zoning for the server to boot on either fabric we create resilience. All of this is just to demonstrate how we construct the zoning configuration so things will no doubt be different in a different environment.

So now we know what we should have in our populated alias database, and our zone configuration, we just need to create our zoneset. Well, we will have one zoneset per fabric, so one for NexusA:

Zoneset Name Members
UCSZonesetA UCSServ01_a

UCSServ02_a

UCSServ03_a

And the zoneset for NexusB:

Zoneset Name Members
UCSZonesetB UCSServ01_b

UCSServ02_b

UCSServ03_b

Now we are ready to put this into some NXOS CLI, and enter this on our switches. The general commands for creating new aliases are:
device-alias database
device-alias name <alias_name> pwwn <device_wwpn>
exit
device-alias commit

So for our NexusA, we do the following:
device-alias database
device-alias name NAFAS01_e0c pwwn 35:20:01:0c:11:22:33:44
device-alias name NAFAS02_e0c pwwn 35:20:02:0c:11:22:33:44
device-alias name UCSServ01_0a pwwn 50:02:77:a4:10:0c:0a:01
device-alias name UCSServ02_0a pwwn 50:02:77:a4:10:0c:0a:02
device-alias name UCSServ03_0a pwwn 50:02:77:a4:10:0c:0a:03
exit
device-alias commit

And for Nexus B, we do:
device-alias database
device-alias name NAFAS01_e0d pwwn 35:20:01:0d:11:22:33:44
device-alias name NAFAS02_e0d pwwn 35:20:02:0d:11:22:33:44
device-alias name UCSServ01_0b pwwn 50:02:77:a4:10:0c:0b:01
device-alias name UCSServ02_0b pwwn 50:02:77:a4:10:0c:0b:02
device-alias name UCSServ03_0b pwwn 50:02:77:a4:10:0c:0b:03
exit
device-alias commit

So that’s our alias database taken care of, now we can create our zones. The command set for creating a zone is:
zone name <zone_name> vsan <vsan_id>
member device-alias <device_1_alias>
member device-alias <device_2_alias>
member device-alias <device_3_alias>
exit

I will use VSAN IDs 101 for fabric A, and 102 for fabric B. So here we will create our zones for NexusA:
zone name UCSServ01_a vsan 101
member device-alias NAFAS01_e0c
member device-alias NAFAS02_e0c
member device-alias UCSServ01_0a
exit
zone name UCSServ02_a vsan 101
member device-alias NAFAS01_e0c
member device-alias NAFAS02_e0c
member device-alias UCSServ02_0a
exit
zone name UCSServ03_a vsan 101
member device-alias NAFAS01_e0c
member device-alias NAFAS02_e0c
member device-alias UCSServ03_0a
exit

And for NexusB:
zone name UCSServ01_b vsan 102
member device-alias NAFAS01_e0d
member device-alias NAFAS02_e0d
member device-alias UCSServ01_0b
exit
zone name UCSServ02_b vsan 102
member device-alias NAFAS01_e0d
member device-alias NAFAS02_e0d
member device-alias UCSServ02_0b
exit
zone name UCSServ03_b vsan 102
member device-alias NAFAS01_e0d
member device-alias NAFAS02_e0d
member device-alias UCSServ03_0b
exit

So this is all of our zones created, now we just need to create and activate our zoneset and we have our completed zoning configuration. The commands to create and activate a zoneset are:
zoneset name <zoneset_name> vsan <vsan_id>
member <zone_1_name>
member <zone_2_name>
exit
zoneset activate name <zoneset_name> vsan <vsan_id>
exit

So now we have our NexusA configuration:
zoneset name UCSZonesetA vsan 101
member UCSServ01_a
member UCSServ02_a
member UCSServ03_a
exit
zoneset activate name UCSZonesetA vsan 101
exit

And our NexusB configuration:
zoneset name UCSZonesetB vsan 102
member UCSServ01_b
member UCSServ02_b
member UCSServ03_b
exit
zoneset activate name UCSZonesetB vsan 102
exit

So that’s how we compose our zoning configuration, and apply it to our Nexus switch. Hopefully this will be a useful reference on how to do this.

NetApp SnapCenter 1.0 – a new hope…

NetApp recently released version 1.0 of a new software offering going by the name of SnapCenter. It’s a long held tradition that 80% of NetApp’s releases contain the word ‘snap’, continuing to point out their ages old innovation in storage of snapshot technology providing efficient, speedy backups of your precious data.

 Screen Shot 2015-10-13 at 18.16.40

So what does SnapCenter bring to the table that we did not have before? Well first we need some context…

SnapDrive is Windows/UNIX software which taps into a NetApp storage system, allowing the provisioning, backup, restoration, and administration of storage resources without having to directly log onto the storage system. This enables application owners to take control of their own backup/restore operations and therefore feel more able to manage their data. For applications or server roles which are not subject to issues with inconsistency in backups the backup/restore features in SnapDrive are fine. Where applications are used which do have this concern, NetApp have provided another solution.

With me so far? Good. So SnapDrive is supplemented by the SnapManager suite of products. These have been built up over a long period of time by NetApp, and integrate directly with applications like:

  • SQL Server
  • Oracle
  • VMware
  • Hyper-V
  • Sharepoint
  • Exchange
  • SAP

These applications have vastly different purposes, but have equally unique requirements in terms of backing up their data in an application consistent way. Usually creating a backup/restore strategy which produces application consistent backups requires detailed understanding of the application, and is not integrated with the features presented by the underlying storage.

The SnapManager suite of products fills this gap, delivering a simplified, storage-integrated, application consistent method of easily backing up and restoring data, and providing the features that application owners desire. Further to this, it gives the application owners a simple GUI to take ownership of their own backup and recovery, whilst ensuring nothing in the underlying storage will break.

But this panacea to the challenge of backup and recovery, and its place within the application stack, is not without fault. Many criticisms have been levelled at the SnapManager suite over the years. The main two criticisms which I believe SnapCenter addresses are:

  1. Inconsistent user interfaces – the SnapManager suite was built up over time by NetApp, and many of the products were developed by different internal teams. This meant that the resultant software has very different looks and feels as you transition from one product to another. This complicates administration of the product for infrastructure administrators because they end up with multiple GUIs to learn, instead of a single GUI
  2. Scalability issues – to be fair to NetApp, this is not just an issue with their solution, a previous workplace of mine were heavy users of IBM’s Tivoli Storage Manager and that had a similar issue which is, as your environment grows, you may end up with tens of SQL servers, which means tens of instances of SnapManager for SQL to install, update, manage, and monitor, this could mean thousands upon thousands of reports and alerts to sift through each day, and without a solution to manage this, issues will go undiscovered for days, weeks or even months. Once you add in your Exchange environments, vCenter servers, Sharepoint farms, Oracle servers etc, you may be looking at tens of thousands of backups running a day, and potentially hundreds of pieces of installed software to manage and try to keep an eye on

So how does SnapCenter address this problem? Well, with the release of Clustered Data ONTAP (CDOT) 8.3 at the start of 2015, and the end of NetApp’s legacy 7-Mode operating system, there seems to have been a drive to revitalise their software and hardware lines, simplifying the available options, and pushing software interfaces to be web based, rather than thick GUIs.

So the value proposition with SnapCenter is a centrally managed point of reference to control your backups programatically, with a modern web based interface, and scalability to provide a workable solution regardless of the size of estate being backed up. So let’s look at these features, and how NetApp have delivered this:

1. Scalability

Scalability utilises the Windows NLB and ARR (Application Request Routing, basically a reverse web proxy) features to allow for the creation of a farm of SnapCenter servers up to the maximum size allowed by Windows NLB of 32 nodes.

SnapCenter utilises a SQL database as its back end, this can be either a local SQL Server Express instance (for small deployments), or a full SQL Server instance for scalable deployments.

2. Programability

NetApp have also been pretty decent at including programmability in their more recent software offerings, and SnapCenter is no exception, of course providing a PowerShell cmdlet pack, and of course the now ubiquitous REST API. SnapCenter is also policy-driven, which means once you have created your backup policy sets you can apply them to new datasets you want to backup going forward, this helps to keep manageability of backups under control as your infrastructure grows.

3. Interface

A web interface is a beautiful thing, accessing software from any browser on any OS makes life a lot easier for administrators, and not logging onto servers means less chance of breaking said servers. NetApp have chosen HTML5 for this interface which does away with the pain of having to deal with Java or Flash which plagues other web interfaces (UCS, VMware, I’m looking at you!). NetApp have raised the bar with the SnapCenter interface, producing a smart and stylish WUI not dissimilar to Microsoft’s Azure interface.

3506i6C464D71FA4BF802

Once you have installed the SnapCenter software on your Windows server, you will need to use the software to deploy the Windows and SQL Server plugins to your SQL servers. These plug-ins replace SnapDrive and SnapManager respectively, but this deployment process promises to be quick and painless, and a reboot should not be necessary. SnapCenter utilises the same licenses as SnapManager so if this is already licensed on your storage system then you are good to go. There is a migration feature present to help you move from SnapManager to SnapCenter, although this does not support migration of databases on VMDKs at this time.

The initial release of SnapCenter only interoperates with SQL Server, and VMware through the Virtual Storage Console (VSC), so it probably won’t replace many customer’s full SnapManager install bases just yet, but the delivery team are promising rollouts of more plugins over the coming months.

There are limitations even in the SQL backup/recovery capabilities, although these will likely not affect many customers, these are detailed in the product Release Notes, but the biggest of these from what I can see is that SnapCenter does not presently support SQL databases on SMB volumes.

Hopefully NetApp will provide regular and functionality enhancing updates to this product so that it delivers on its promises. It would also be good to see some functionality enhancements over what is currently delivered by the SnapManager products, top of the list from my perspective is allowing Exchange databases to reside on VMDK storage as the current restriction on this being purely LUN based makes things difficult, especially where customers are not deploying iSCSI, as this means the dreaded RDMs must be used in VMware, which as a VMware admin causes no end of headaches. It would also be nice to see this offered at some point as a virtual appliance, perhaps with an embedded PostgreSQL type database similar to what VMware offer for the vCenter Server Appliance, but that will be way down the line I would imagine as providing an appliance that scales well is a difficult thing.

NetApp have promised to continue to deliver SnapManager products for the time being, this is needed because of the lack of 7-Mode support in SnapCenter. Having worked extensively with both CDOT and 7-Mode though, I think there are many compelling reasons to move to CDOT if possible, and this seems like a fair compromise. SnapCenter can be installed quickly and tested out without committing to moving all your databases over to it, so give it a try, it’s the future after all!

FlexPod 101 – What is a FlexPod?

I haven’t posted for a while, I started a new job, getting out of IT support, and into the area I want to be, designing and implementing infrastructure solutions as a FlexPod consultant. I have not worked with FlexPod as a concept before, but I have worked with the integral technology stack which comprises it. So far so good, it seems like a robust solution which provides a balance between scalability, performance and cost. I have decided to do a set of blog posts going through the concept and technology behind FlexPod, hopefully highlighting what sets this apart from the competition.

FlexPod: what the hell is that?

Over the last few years, the IT industry has moved from disparate silos for storage, compute and network, towards the converged (and later hyper converged) dream. One such player in this market is the FlexPod.

A collaboration between NetApp and Cisco, at a basic level this comprises the following enterprise class hardware:

  • Cisco Unified Computing System (UCS)
  • Cisco Nexus switching/routing
  • NetApp FAS Storage Arrays

This forms the hardware basis, and as is the industry’s want, there are a swathe of virtualisation solutions and business critical applications which can be used on top of the hardware:

  • VMware vSphere
  • Microsoft Hyper-V
  • Openstack
  • Citrix XenDesktop
  • SAP
  • Oracle
  • VMware Horizon View

I have been a fan of NetApp storage, and Cisco UCS compute for a while. They both offer simplicity and power in equal measure. My preference for hypervisor is ESXi but Hyper-V is becoming a more compelling solution with every release.

You throw an automation product like vRealize Automation or UCS Director on top of the stack and you have a powerful and modern private cloud solution which takes you beyond what a standard virtualised solution will deliver.

Why not just by <enter converged/hyper-converged vendor here>?

But you can run this on any hardware, right? So what sets FlexPod apart from VCE’s Vblock, hyper-converged systems like Nutanix, SimpliVity, or just rolling your own infrastructure?

The answer is the Cisco Validated Design (CVD). This is, as the name suggests, a validated and documented build blueprint, which details proven hardware and software configurations for a growing number of solutions. This gives a confidence when implementing the solution, that this will work, and goes towards putting the ‘Flex’ in FlexPod.

The other advantage of FlexPod over other converged/hyper-converged solutions is that you can tweak the scales on the hardware components (conpute/storage/network), to make the solution larger in the areas you need capacity boost, but keep it the same in the areas you don’t. You need 100TB of storage, just buy more shelves. You need 100 hosts, just buy more UCS chassis and blades. This non-linear scalability, and flexibility, separates FlexPod from rival solutions.

As far as the software, and general protocol usage goes, FlexPod is fairly agnostic. You can use FCoE, NFS, FC or iSCSI as your storage protocols, you can use whatever hypervisor you want as long as there is a CVD for it, and chances are you can find one to suit your use case.

Where can I find more information on FlexPod?

The NetApp and Cisco sites have information about what a FlexPod consists of:

http://www.netapp.com/ca/solutions/cloud/flexpod/

http://www.cisco.com/c/en/us/solutions/data-center-virtualization/flexpod/index.html

The Cisco site also has links to the CVDs, these give a good overview of what the FlexPod is about.

What’s next?

Part 2 of my FlexPod 101 series will go over the physical components of a FlexPod.

NetApp Cluster Mode Data ONTAP (CDOT) 8.3 Reversion to 8.2 7-Mode

A project came in at work to build out a couple of new NetApp FAS2552 arrays; this was to replace old FAS2020s for a customer who was using FCP in their Production datacenter, and iSCSI in their DR datacenter, with a semi-synchronous Snapmirror relationship between the two.

The new arrays arrived on site, and we set them up separate from the production network, to configure them. We quickly identified that the 2552s were running OnTap 8.3RC1, which is how they were sent to us out of the factory. Nobody had any experience with Cluster Mode Data ONTAP, but this didn’t seem too much of a challenge, as it did not seem hugely different.

After looking what to do next, it appeared that transitioning SAN volumes from 7-mode to Cluster Mode Data ONTAP is not possible, so the decision was taken to downgrade the OS from 8.3RC1, to 8.2 7-mode to make the transition of the customer’s data, and the downtime during switchover from old arrays to new, be as easy and quick as possible.

We got there in the end, but due to the tomes of documentation we had to trawl through, and tie together, I decided to document the process, to assist any would be future CDOT luddites in carrying out this task.

NOTE: This has not been tested on anything other than a FAS2552 with two controllers, and if you are in any way uncertain I would suggest contacting NetApp support for assistance. As this was a brand new array, and there was no risk of data loss, we proceeded regardless. You will need a NetApp support account to access some of the documentation and downloads referenced below. This is the way we completed the downgrade, not saying it is the best way, and although I have many years experience of working with NetApp arrays, this is just a guide.

  • Downloading and updating the boot image:

We decided on 8.2.3 for our boot image, this was the last edition of Data ONTAP with 7-mode included. If you go to http://mysupport.netapp.com/NOW/cgi-bin/software/ and select your array type you will see the available versions for your array. There are pages of disclaimers to agree to, and documents of pre-requisites and release notes for each version, these are worth reading to ensure there are no known issues with your array type. Eventually you will get the download, it will be a .tgz file.

You will now need a system with IP connectivity to both controllers, and use something like FileZilla Server to host the file via FTP. This will allow you to get the file up to the controller. I am not going to include steps to setup your FTP server, but there are plenty of resources online to do this. You could also host this via HTTP using something like IIS if that is more convenient.

Now to pull the image onto the array, this will need doing on both controllers (nodes), this document was followed, specifically the following command (based on content on page 143):

 system node image get -node localhost -package <location> -replace-package true - background true

I changed the command to replace ‘-node *’ with ‘-node localhost’ so we could download the image to each node in turn, this was just to ensure we could tackle any issues with the download. I also removed the ‘-background true’ switch, which would run the download in the background, this was to give us maximum visibility.

Now our cluster had never been properly configured, there are a bunch of checks to do at this point to ensure your node is ready for the reversion, these are all detailed in the above document and should be followed to make sure nothing is amiss. We ran through these checks prior to installing the newly downloaded image. This includes things

Once happy, the image can be installed by running:

system node image update -node localhost -package file:///mroot/etc/software/<image_name>

The image name will be the name of the .tgz file you downloaded to the controller earlier (including the extension).

Once the image is installed, you can check the state of the installation with:

 system image show

This should show something like:

Screen Shot 2015-02-12 at 20.41.26

This shows the images for one controller only, but shows us the image we are reverting to is loaded into the system, and we can move on.

There are some more steps in the document to follow, ensuring the cluster is shutdown, and failover is disabled before we can revert, follow these from the same document as above.

Next we would normally run ‘revert_to 8.2’ to revert the firmware. However, we had issues at this point because of the ADP (Advanced Drive Partitioning), which seems to mark the disks as in a shared container. It goes into the background here, in Dan Barber’s excellent article. Long story short, we decided to reboot and format the array again to get round this.

  • Re-zeroing the disks and building new vol0:

We rebooted the first controller, and saw that when it came back up it was running in 8.2.3 (yay) Cluster Mode (boo). We tried zeroing the disks and building a new vol0, by interrupting the boot sequence with Ctrl+C to get to the special boot menu, and then running option 4, this was no good for us though, because once built, the controller booted into 8.2.3 Cluster Mode, a new tactic would be required.

We found this blog post on Krish Palamadathil’s blog, which detailed how to get around this. The downloaded image contains both Cluster Mode and 7-Mode images, but boots into Cluster Mode by default when doing this reversion. Cutting to the chase, the only thing we needed to do was to get to the Boot Loader (Ctrl+C during reboot to abort the boot process), and then run the following commands:

 LOADER> set-defaults 
 LOADER> boot_ontap

We then saw the controller come up in 8.2.3 7-Mode, interrupted the boot sequence, and ran an option 4 to zero the disks again and build a new vol0

Happy to say that the array is now at the correct version and in a state where it can now be configured. As usual, the NetApp documentation was great, even if we had to source steps from numerous different places. As this is still a very new version of Data ONTAP I would expect this documentation to get better over time, in the meantime hopefully this guide can be of use to people.