Integrating Platform Services Controller/vCSA 6.0 with Active Directory using CLI

I am currently automating the build and configuration of a VMware vCenter environment, and the internet has been a great source of material in helping me with this, particularly William Lam and Brian Graf‘s websites. It seems VMware have done a great job with the changes in vSphere 6.0 of enabling automated deployments, this follows the general trends of the industry in driving automation and orchestration with everything we do in the Systems Administration world.

One thing I needed to do which I could not find any information on, was to join my standalone Platform Services Controller (PSC) to an AD domain, this is easy enough in the GUI, and is documented here. It was important for me to automate this however, so I trawled through the CLI on my PSC to figure out how to do this.

I stumbled across the following command which joins you to the AD domain of your choosing.

/usr/lib/vmware-vmafd/bin/vmafd-cli join-ad --server-name <server name> --user-name <user-name> --password <password> --machine-name <machine name> --domain-name <domain name>

Once this is completed the PSC will need restarting, to enable the change. This will add the PSC to Active Directory. The next challenge was finding a scripted method to add the identity source. Once the identity source is added, permissions can be set up as normal in vCenter using this identity source.

Again, I had to trawl through the PSC OS to find this, the script is as follows:

/usr/lib/vmidentity/tools/scripts/sso-add-native-ad-idp.sh <Native-Active-Dir-Domain-Name>

Both of these can be carried out through an SSH session to your PSC (or embedded PSC/VCSA server). Assuming you have BASH enabled on your PSC, you can invoke this remotely using the PowerCLI ‘Invoke-VMScript’ cmdlet. This should help in the process of fully automating the process of deploying a vCenter environment.

As an aside, one issue I did have, which is discussed in the VMware forums is that I was getting the error ‘Error while extracting local SSO users’ when enumerating users/groups from my AD in the VMware GUI, this was fixed by creating a PTR record in DNS in my domain for the Domain Controller, it seems this is needed by the new VMware SSO at some point.

I hope this is useful to people, and hopefully VMware will document this sort of automation in the future, in the meantime, as I said above, William Lam and Brian Graf’s sites are a good source of information.

Advertisements

NetApp SnapCenter 1.0 – a new hope…

NetApp recently released version 1.0 of a new software offering going by the name of SnapCenter. It’s a long held tradition that 80% of NetApp’s releases contain the word ‘snap’, continuing to point out their ages old innovation in storage of snapshot technology providing efficient, speedy backups of your precious data.

 Screen Shot 2015-10-13 at 18.16.40

So what does SnapCenter bring to the table that we did not have before? Well first we need some context…

SnapDrive is Windows/UNIX software which taps into a NetApp storage system, allowing the provisioning, backup, restoration, and administration of storage resources without having to directly log onto the storage system. This enables application owners to take control of their own backup/restore operations and therefore feel more able to manage their data. For applications or server roles which are not subject to issues with inconsistency in backups the backup/restore features in SnapDrive are fine. Where applications are used which do have this concern, NetApp have provided another solution.

With me so far? Good. So SnapDrive is supplemented by the SnapManager suite of products. These have been built up over a long period of time by NetApp, and integrate directly with applications like:

  • SQL Server
  • Oracle
  • VMware
  • Hyper-V
  • Sharepoint
  • Exchange
  • SAP

These applications have vastly different purposes, but have equally unique requirements in terms of backing up their data in an application consistent way. Usually creating a backup/restore strategy which produces application consistent backups requires detailed understanding of the application, and is not integrated with the features presented by the underlying storage.

The SnapManager suite of products fills this gap, delivering a simplified, storage-integrated, application consistent method of easily backing up and restoring data, and providing the features that application owners desire. Further to this, it gives the application owners a simple GUI to take ownership of their own backup and recovery, whilst ensuring nothing in the underlying storage will break.

But this panacea to the challenge of backup and recovery, and its place within the application stack, is not without fault. Many criticisms have been levelled at the SnapManager suite over the years. The main two criticisms which I believe SnapCenter addresses are:

  1. Inconsistent user interfaces – the SnapManager suite was built up over time by NetApp, and many of the products were developed by different internal teams. This meant that the resultant software has very different looks and feels as you transition from one product to another. This complicates administration of the product for infrastructure administrators because they end up with multiple GUIs to learn, instead of a single GUI
  2. Scalability issues – to be fair to NetApp, this is not just an issue with their solution, a previous workplace of mine were heavy users of IBM’s Tivoli Storage Manager and that had a similar issue which is, as your environment grows, you may end up with tens of SQL servers, which means tens of instances of SnapManager for SQL to install, update, manage, and monitor, this could mean thousands upon thousands of reports and alerts to sift through each day, and without a solution to manage this, issues will go undiscovered for days, weeks or even months. Once you add in your Exchange environments, vCenter servers, Sharepoint farms, Oracle servers etc, you may be looking at tens of thousands of backups running a day, and potentially hundreds of pieces of installed software to manage and try to keep an eye on

So how does SnapCenter address this problem? Well, with the release of Clustered Data ONTAP (CDOT) 8.3 at the start of 2015, and the end of NetApp’s legacy 7-Mode operating system, there seems to have been a drive to revitalise their software and hardware lines, simplifying the available options, and pushing software interfaces to be web based, rather than thick GUIs.

So the value proposition with SnapCenter is a centrally managed point of reference to control your backups programatically, with a modern web based interface, and scalability to provide a workable solution regardless of the size of estate being backed up. So let’s look at these features, and how NetApp have delivered this:

1. Scalability

Scalability utilises the Windows NLB and ARR (Application Request Routing, basically a reverse web proxy) features to allow for the creation of a farm of SnapCenter servers up to the maximum size allowed by Windows NLB of 32 nodes.

SnapCenter utilises a SQL database as its back end, this can be either a local SQL Server Express instance (for small deployments), or a full SQL Server instance for scalable deployments.

2. Programability

NetApp have also been pretty decent at including programmability in their more recent software offerings, and SnapCenter is no exception, of course providing a PowerShell cmdlet pack, and of course the now ubiquitous REST API. SnapCenter is also policy-driven, which means once you have created your backup policy sets you can apply them to new datasets you want to backup going forward, this helps to keep manageability of backups under control as your infrastructure grows.

3. Interface

A web interface is a beautiful thing, accessing software from any browser on any OS makes life a lot easier for administrators, and not logging onto servers means less chance of breaking said servers. NetApp have chosen HTML5 for this interface which does away with the pain of having to deal with Java or Flash which plagues other web interfaces (UCS, VMware, I’m looking at you!). NetApp have raised the bar with the SnapCenter interface, producing a smart and stylish WUI not dissimilar to Microsoft’s Azure interface.

3506i6C464D71FA4BF802

Once you have installed the SnapCenter software on your Windows server, you will need to use the software to deploy the Windows and SQL Server plugins to your SQL servers. These plug-ins replace SnapDrive and SnapManager respectively, but this deployment process promises to be quick and painless, and a reboot should not be necessary. SnapCenter utilises the same licenses as SnapManager so if this is already licensed on your storage system then you are good to go. There is a migration feature present to help you move from SnapManager to SnapCenter, although this does not support migration of databases on VMDKs at this time.

The initial release of SnapCenter only interoperates with SQL Server, and VMware through the Virtual Storage Console (VSC), so it probably won’t replace many customer’s full SnapManager install bases just yet, but the delivery team are promising rollouts of more plugins over the coming months.

There are limitations even in the SQL backup/recovery capabilities, although these will likely not affect many customers, these are detailed in the product Release Notes, but the biggest of these from what I can see is that SnapCenter does not presently support SQL databases on SMB volumes.

Hopefully NetApp will provide regular and functionality enhancing updates to this product so that it delivers on its promises. It would also be good to see some functionality enhancements over what is currently delivered by the SnapManager products, top of the list from my perspective is allowing Exchange databases to reside on VMDK storage as the current restriction on this being purely LUN based makes things difficult, especially where customers are not deploying iSCSI, as this means the dreaded RDMs must be used in VMware, which as a VMware admin causes no end of headaches. It would also be nice to see this offered at some point as a virtual appliance, perhaps with an embedded PostgreSQL type database similar to what VMware offer for the vCenter Server Appliance, but that will be way down the line I would imagine as providing an appliance that scales well is a difficult thing.

NetApp have promised to continue to deliver SnapManager products for the time being, this is needed because of the lack of 7-Mode support in SnapCenter. Having worked extensively with both CDOT and 7-Mode though, I think there are many compelling reasons to move to CDOT if possible, and this seems like a fair compromise. SnapCenter can be installed quickly and tested out without committing to moving all your databases over to it, so give it a try, it’s the future after all!

FlexPod 101 – What is a FlexPod?

I haven’t posted for a while, I started a new job, getting out of IT support, and into the area I want to be, designing and implementing infrastructure solutions as a FlexPod consultant. I have not worked with FlexPod as a concept before, but I have worked with the integral technology stack which comprises it. So far so good, it seems like a robust solution which provides a balance between scalability, performance and cost. I have decided to do a set of blog posts going through the concept and technology behind FlexPod, hopefully highlighting what sets this apart from the competition.

FlexPod: what the hell is that?

Over the last few years, the IT industry has moved from disparate silos for storage, compute and network, towards the converged (and later hyper converged) dream. One such player in this market is the FlexPod.

A collaboration between NetApp and Cisco, at a basic level this comprises the following enterprise class hardware:

  • Cisco Unified Computing System (UCS)
  • Cisco Nexus switching/routing
  • NetApp FAS Storage Arrays

This forms the hardware basis, and as is the industry’s want, there are a swathe of virtualisation solutions and business critical applications which can be used on top of the hardware:

  • VMware vSphere
  • Microsoft Hyper-V
  • Openstack
  • Citrix XenDesktop
  • SAP
  • Oracle
  • VMware Horizon View

I have been a fan of NetApp storage, and Cisco UCS compute for a while. They both offer simplicity and power in equal measure. My preference for hypervisor is ESXi but Hyper-V is becoming a more compelling solution with every release.

You throw an automation product like vRealize Automation or UCS Director on top of the stack and you have a powerful and modern private cloud solution which takes you beyond what a standard virtualised solution will deliver.

Why not just by <enter converged/hyper-converged vendor here>?

But you can run this on any hardware, right? So what sets FlexPod apart from VCE’s Vblock, hyper-converged systems like Nutanix, SimpliVity, or just rolling your own infrastructure?

The answer is the Cisco Validated Design (CVD). This is, as the name suggests, a validated and documented build blueprint, which details proven hardware and software configurations for a growing number of solutions. This gives a confidence when implementing the solution, that this will work, and goes towards putting the ‘Flex’ in FlexPod.

The other advantage of FlexPod over other converged/hyper-converged solutions is that you can tweak the scales on the hardware components (conpute/storage/network), to make the solution larger in the areas you need capacity boost, but keep it the same in the areas you don’t. You need 100TB of storage, just buy more shelves. You need 100 hosts, just buy more UCS chassis and blades. This non-linear scalability, and flexibility, separates FlexPod from rival solutions.

As far as the software, and general protocol usage goes, FlexPod is fairly agnostic. You can use FCoE, NFS, FC or iSCSI as your storage protocols, you can use whatever hypervisor you want as long as there is a CVD for it, and chances are you can find one to suit your use case.

Where can I find more information on FlexPod?

The NetApp and Cisco sites have information about what a FlexPod consists of:

http://www.netapp.com/ca/solutions/cloud/flexpod/

http://www.cisco.com/c/en/us/solutions/data-center-virtualization/flexpod/index.html

The Cisco site also has links to the CVDs, these give a good overview of what the FlexPod is about.

What’s next?

Part 2 of my FlexPod 101 series will go over the physical components of a FlexPod.

A gap in the clouds – what does the future hold?

8626937519_739ae789df_z

I had the pleasure of attending the TECH.unplugged event in London this week, hosted by Enrico Signoretti. This was an independent event with a number of technology analysts from around the world providing many interesting opinions on a wide variety of subjects.

One session in particular which fired up my thinking about the future of the tech industry was from Stephen Foskett. For those who don’t know, Stephen organises and hosts the Tech Field Day events in the US (techfieldday.com). As an aside, these events are worth keeping an eye out for to see deep dive discussions with vendors over their latest technological offerings. This talk at TECH.unplugged was one which really twigged with me, and over the last few days I have mustered the thoughts below together.

Stephen’s talk was around the idea that Cloud Computing, everyone’s favourite buzzword, will become just a part of computing. This was illustrated by numerous points in the development of computing through the ages, which never went away, despite them falling out of favour and becoming unsexy in the tech world. These include:

The mainframe – still in use today, and still very much alive. See the new IBM z13 mainframe, this was only released in Jan 2015, but shows that there is still development, and iteration in mainframe products today. These are still used across many industries, and although the IT industry is not focussed on these, they are still a crucial component of what we deliver

Tape backup – people would love this to go away I am sure, tape has been around for decades, and presents a cornerstone in backup and data retention strategies, from small to large organisations. In a time where we can fit many terabytes of information on a chip the size of a fingernail, why are we still using tape? The answer is that tape continues to present the two same advantages it always has: it is cheap, and it is reliable. As much as we see $/GB prices fall and fall with magnetic disk, and now with SSD, tape still kills it in price

Physical servers – sure, virtualisation is king in the data centre today, and for the last 5-10 years, that stuff left running on bare metal tin, without a hypervisor, better have a good reason for that or it is going to get P2Vd at some point.

So how does Cloud fit in here? The industry heralds crow their message over and over again: “The data centre is dead, long live the Cloud”, and those of us still supporting physical tin are hoping and praying that the prophecy does not come true. We hear tales of the developers rallying under the DevOps banner, with Docker, Vagrant and Puppet as their drawn weapons, ready to rise up against the slow traditional infrastructure and move our business’ most critical workloads to AWS to allow them the freedom of rapid, agile development they need.

But realistically, will this spell the end to all we have built in our data centres, for all of AWS’ £1bn quarterly earnings, and their 100% year on year growth reported this week, will this actually kill the data centre. Well no doubt, there are applications, many of them being written today, which are what VMware calls ‘Cloud Native Apps (CNA)’, that is applications designed to run at scale, to be designed around the micro-services model, and to deliver scale up and down on demand. Applications utilising continuous integration and delivery models, to allow hundreds of code releases each day, delivering what the business needs in near-real time.

Well it seems that this is a new part of the puzzle, filling a hole which we in the data centre business did not realise was there, but one which developers leapt to utilise. We have tried to deliver, but alas we were too late. We tried to deliver solutions like vCloud Director, like OpenStack, but it seems these products were not quick enough, or too clunky, or just too damned difficult to install. So our businesses are already using cloud services, and they are seeing just how great it is. We are losing control of our IT.

But all is not lost, as far as our workloads go, there are no doubt applications perfectly suited to the cloud (for some businesses, at least). We use ServiceNow at my current workplace as our CRM software, and since moving to the cloud the application is faster, more available, and updated far more often.

There are no doubt applications that some businesses do not want to put in the cloud though, and despite arguments from both sides, the feeling that data is safer in your on-premises data center, than it is in Amazon’s or Microsoft’s, is still commonplace. Cloud advocates insist that one day all applications will be in the Cloud, I don’t think this will be the case. As Stephen illustrated in his presentation, Cloud Computing will be but one tool that the IT industry has to deliver the services end users need.

VMware’s release of Photon is a good analogy for the way I feel that software delivery will go, with private cloud getting software and hardware offerings which mirror what is going on in public cloud. This will give those IT organisations who don’t want, or are not allowed, to put their applications and data in the cloud, the same tools in-house that developers are crying out for. So we should embrace tools like this, encouraging vendors to release more tools like Photon/Lightwave, which gives us a secure, on-premises approximation of what our developers cry out for, and most importantly, learn to implement, use and support these tools, so that we can be part of the Cloud revolution, and be ready to bring these tools into our toolset.

The overarching message from Stephen’s presentation though, was that Cloud is coming, and soon it will just be another part of what we do. As technology advocates, we learn new tech all the time, and the DevOps movement, Cloud, containerisation, whatever it may be, it is all just more new tech to learn, so don’t be afraid; learn it, do it, master it. We are IT people, that’s what we do.

Transparent Page Sharing – for better or worse

Transparent Page Sharing (TPS) is one of the cornerstones in memory management in the ESXi hypervisor, this is one of the many technologies VMware have developed, which allows higher VM consolidation ratios on hosts through intelligent analysis of VM memory utilisation, and deduplication based on this analysis.

There are two types of TPS, intra-VM and inter-VM. The first scans the memory in use by a single VM and deduplicates common block patterns, this will lower the total host memory consumption of a VM, without significantly impacting VM performance. The second type; inter-VM TPS, does the same process, looking for common blocks across memory usage for all VMs on a given host.

Historically this has been a successful reclamation technique, and has led to decent savings in host memory consumption, however, most modern Operating Systems seen in virtualised environments (most Linux distributions, Windows Server 2008 onwards) now use memory encryption by default so the chance of the TPS daemon finding common blocks becomes less and less likely.

If CPU integrated hardware-assisted memory virtualisation features (AMD Rapid Virtualisation Indexing (RVI) or Intel Extended Page Tables (EPT)) are utilised for an ESXi host, then the hypervisor will use 2MB block size for its TPS calculations, rather than the normal 4KB block size. Attempting to deduplicate in 2MB chunks is far more resource intensive, and far less successful, than running the same process in 4KB chunks, thus ESXi will not attempt to deduplicate and share large memory pages by default.

The upshot of this is that 2MB pages are scanned, and 4KB blocks within these large pages  hashed in preparation for inducing memory sharing should the host come under memory contention, in an effort to prevent sharing. Pre-hashing these 4KB chunks will mean that TPS is able to quickly react, deduplicating and sharing the pages should the need arise.

All good so far, this technique should help us to save a bit of memory, although since memory virtualisation features in modern CPUs are widespread, and larger amounts of host memory more common, the potential TPS savings should hopefully never be needed or seen.

At the back end of last year, VMware announced that they would be disabling TPS by default in ESXi following an academic paper which showed a potential security vulnerability in TPS which, if exploited, could result in sensitive data being made available from one VM to another utilising memory sharing. It should be noted that the technique used to exploit this is in a highly controlled laboratory style condition, and requires physical access to the host in question, and that the researcher highlighting it never actually managed to glean any data in this method.

Despite the theoretical nature of the published vulnerability, VMware took the precautionary approach, and so in ESXi 5.0, 5.1, 5.5 and now 6.0, with the latest updates, TPS is now disabled by default. What does this mean for the enterprise though, and what choices do we have?

1.Turn TPS on regardless – if you have a dedicated internal-only infrastructure, then it may be that you do not care about the risks exposed by the research. If you and your team of system administrators are the only ones with access to your ESXi servers, and the VMs within, as is common in many internal IT departments, then there are likely far easier ways to get access to sensitive data than by utilising this theoretical technique anyway

2.Turn off TPS – if you are in a shared Service Provider style infrastructure, or in an environment requiring the highest security, then this should be a no-brainer. The integrity and security of the client data you have on your systems should be foremost in your organisations mind, and in the interests of precaution, and good practice, you should disable TPS on existing systems and leave it off

This were the options presented by VMware until about a week ago, when an article was published which described a third option being introduced in the next ESXi updates:

3. Use TPS with guest VM salting – this allows selective inter-VM TPS enablement among sets of selected VMs, allowing you to limit the potential areas of vulnerability, using a combination of edits to .vmx files, and advanced host settings. This may be a good middle ground if you are reliant on the benefits provided by TPS, but security policy demands that it not be in the previous default mode

So these are our options, and regardless of which one you choose, you need to know what the difference in your environment will be if you turn off TPS, this is going to be different for everyone. Your current savings being delivered by TPS can be calculated, and this should give you some idea of what the change in host memory utilisation will be following the disabling of TPS.

The quickest way to see this for an individual host is via the Performance tab in vSphere Client, if you look at real time memory usage, and select the ‘Active’, ‘Shared’ and ‘Shared Common’ counters then you will be able to see how much memory is consumed in total by your host, and how much of this is being saved through TPS:

TPS_vSphere_Client

Here we can see:

TPS %age saving = (Shared – Shared common) / (Consumed – Used by VMkernel) * 100% = (5743984 – 112448) / (160286356 – 3111764) * 100% = 5631536 / 157174592 * 100% = 3.58%

So TPS is saving around 5.6GB or 3.6% of total memory on the host being consumed by VMs. This is a marker of the efficiency of TPS.

The same figures can be taken from esxtop if you SSH to an ESXi host, run esxtop, and press ‘m’ to get to memory view.

TPS_esxtop

Here we are looking at the PSHARE value, we can see the saving is 5607MB (ties up with above from the vSphere Client), and the memory consumed by VMs can be seen under PMEM/other, in this case 153104MB. Again we can calculate the percentage saving TPS is giving us by dividing the saving by the active memory and multiplying by 100%:

TPS %age saving = PSHARE saving / PMEM other * 100% = 5607 / 153104 * 100% = 3.66%

So this is how we can calculate the saving for each host, but what if you have dozens, or hundreds of hosts in your environment, wouldn’t it be great to get these stats for all your hosts? Well, the easiest way to get this kind of information is usually through PowerCLI so I put the following script together:


# Ask for connection details, then connect using these
$vcenter = Read-Host "Enter vCenter Name or IP"


# Set up our constants for logging
$datetime = get-date -uformat "%C%y%m%d-%H%M"
$OutputFile = ".\" + $datetime + "_" + $vcenter + "_TPS_Report.csv"

# Connect to vCenter
$Connection = Connect-VIServer $vcenter

$myArray = @()

forEach ($Cluster in Get-Cluster) {
foreach($esxhost in ($Cluster | Get-VMHost | Where { ($_.ConnectionState -eq "Connected") -or ($_.ConnectionState -eq "Maintenance")} | Sort Name)) {
$vmdetails = "" | select hostname,clustername,memsizegb,memshavg,memshcom,tpssaving,percenttotalmemsaved,tpsefficiencypercent
$vmdetails.hostname = $esxhost.name
$vmdetails.clustername = $cluster.name
$hostmem = Get-VMHost $esxhost | Select -exp memorytotalgb
$vmdetails.memsizegb = "{0:N0}" -f $hostmem
$vmdetails.memshavg = [math]::Round((Get-VMhost $esxhost | Get-Stat -Stat mem.shared.average -MaxSamples 1 -Realtime | Select -exp value),2)
$vmdetails.memshcom = [math]::Round((Get-VMhost $esxhost | Get-Stat -Stat mem.sharedcommon.average -MaxSamples 1 -Realtime | Select -exp value),2)
$vmdetails.tpssaving = $vmdetails.memshavg-$vmdetails.memshcom
$vmdetails.percenttotalmemsaved = [math]::Round(([int]$vmdetails.tpssaving/([int]$vmdetails.memsizegb*1024*1024))*100,2)
$consumedmemvm = [math]::Round(((Get-VMhost $esxhost | Get-Stat -Stat mem.consumed.average -MaxSamples 1 -Realtime | Select -exp value)-(Get-VMhost $esxhost | Get-Stat -Stat mem.sysUsage.average -MaxSamples 1 -Realtime | Select -exp value)),2)
$vmdetails.tpsefficiencypercent = [math]::Round(([int]$vmdetails.tpssaving/$consumedmemvm)*100,2)
$myArray += $vmdetails
}
}
Disconnect-VIServer * -Confirm:$false

$myArray | Sort Name | Export-Csv -Path $Outputfile

This script will dump out a CSV with every host in your vCenter, and tell you the percentage of total host memory saved by TPS, and the efficiency of TPS in your environment. This should help to provide some idea of what the impacts of TPS being turned off will be.

Ultimately, your organisation’s security policies should define what to do after the next ESXi updates, and how you should act in the meantime, TPS is definitely a useful feature, and does allow for higher consolidation ratios, but security vulnerabilities should not be ignored. Hopefully this post will give you an idea of how TPS is currently impacting your infrastructure.

NetApp Cluster Mode Data ONTAP (CDOT) 8.3 Reversion to 8.2 7-Mode

A project came in at work to build out a couple of new NetApp FAS2552 arrays; this was to replace old FAS2020s for a customer who was using FCP in their Production datacenter, and iSCSI in their DR datacenter, with a semi-synchronous Snapmirror relationship between the two.

The new arrays arrived on site, and we set them up separate from the production network, to configure them. We quickly identified that the 2552s were running OnTap 8.3RC1, which is how they were sent to us out of the factory. Nobody had any experience with Cluster Mode Data ONTAP, but this didn’t seem too much of a challenge, as it did not seem hugely different.

After looking what to do next, it appeared that transitioning SAN volumes from 7-mode to Cluster Mode Data ONTAP is not possible, so the decision was taken to downgrade the OS from 8.3RC1, to 8.2 7-mode to make the transition of the customer’s data, and the downtime during switchover from old arrays to new, be as easy and quick as possible.

We got there in the end, but due to the tomes of documentation we had to trawl through, and tie together, I decided to document the process, to assist any would be future CDOT luddites in carrying out this task.

NOTE: This has not been tested on anything other than a FAS2552 with two controllers, and if you are in any way uncertain I would suggest contacting NetApp support for assistance. As this was a brand new array, and there was no risk of data loss, we proceeded regardless. You will need a NetApp support account to access some of the documentation and downloads referenced below. This is the way we completed the downgrade, not saying it is the best way, and although I have many years experience of working with NetApp arrays, this is just a guide.

  • Downloading and updating the boot image:

We decided on 8.2.3 for our boot image, this was the last edition of Data ONTAP with 7-mode included. If you go to http://mysupport.netapp.com/NOW/cgi-bin/software/ and select your array type you will see the available versions for your array. There are pages of disclaimers to agree to, and documents of pre-requisites and release notes for each version, these are worth reading to ensure there are no known issues with your array type. Eventually you will get the download, it will be a .tgz file.

You will now need a system with IP connectivity to both controllers, and use something like FileZilla Server to host the file via FTP. This will allow you to get the file up to the controller. I am not going to include steps to setup your FTP server, but there are plenty of resources online to do this. You could also host this via HTTP using something like IIS if that is more convenient.

Now to pull the image onto the array, this will need doing on both controllers (nodes), this document was followed, specifically the following command (based on content on page 143):

 system node image get -node localhost -package <location> -replace-package true - background true

I changed the command to replace ‘-node *’ with ‘-node localhost’ so we could download the image to each node in turn, this was just to ensure we could tackle any issues with the download. I also removed the ‘-background true’ switch, which would run the download in the background, this was to give us maximum visibility.

Now our cluster had never been properly configured, there are a bunch of checks to do at this point to ensure your node is ready for the reversion, these are all detailed in the above document and should be followed to make sure nothing is amiss. We ran through these checks prior to installing the newly downloaded image. This includes things

Once happy, the image can be installed by running:

system node image update -node localhost -package file:///mroot/etc/software/<image_name>

The image name will be the name of the .tgz file you downloaded to the controller earlier (including the extension).

Once the image is installed, you can check the state of the installation with:

 system image show

This should show something like:

Screen Shot 2015-02-12 at 20.41.26

This shows the images for one controller only, but shows us the image we are reverting to is loaded into the system, and we can move on.

There are some more steps in the document to follow, ensuring the cluster is shutdown, and failover is disabled before we can revert, follow these from the same document as above.

Next we would normally run ‘revert_to 8.2’ to revert the firmware. However, we had issues at this point because of the ADP (Advanced Drive Partitioning), which seems to mark the disks as in a shared container. It goes into the background here, in Dan Barber’s excellent article. Long story short, we decided to reboot and format the array again to get round this.

  • Re-zeroing the disks and building new vol0:

We rebooted the first controller, and saw that when it came back up it was running in 8.2.3 (yay) Cluster Mode (boo). We tried zeroing the disks and building a new vol0, by interrupting the boot sequence with Ctrl+C to get to the special boot menu, and then running option 4, this was no good for us though, because once built, the controller booted into 8.2.3 Cluster Mode, a new tactic would be required.

We found this blog post on Krish Palamadathil’s blog, which detailed how to get around this. The downloaded image contains both Cluster Mode and 7-Mode images, but boots into Cluster Mode by default when doing this reversion. Cutting to the chase, the only thing we needed to do was to get to the Boot Loader (Ctrl+C during reboot to abort the boot process), and then run the following commands:

 LOADER> set-defaults 
 LOADER> boot_ontap

We then saw the controller come up in 8.2.3 7-Mode, interrupted the boot sequence, and ran an option 4 to zero the disks again and build a new vol0

Happy to say that the array is now at the correct version and in a state where it can now be configured. As usual, the NetApp documentation was great, even if we had to source steps from numerous different places. As this is still a very new version of Data ONTAP I would expect this documentation to get better over time, in the meantime hopefully this guide can be of use to people.

Ballooning – the lowdown

VMware Tools does many things for us as administrators, automating much of the resource management and monitoring we need so we can have confidence in managing the sprawling clusters of a private cloud.

Below are some of the benefits VMware Tools affords us:

  • Driver optimisation
  • Power settings management
  • Time synchronisation
  • Advanced memory management
  • VM Heartbeating

VMware Tools is available for every supported guest Operating System, and with the exception of certain appliances (which often come with 3rd Party tools installed anyway), installation of VMTools should be common practice.

In this article I am going to talk about one specific memory management technique: ballooning. Ballooning is a method for reclaiming host memory in times of contention, this allows for more workloads to run on a host without resorting to using swapping.

The VMTools service/daemon runs as any other process does in an Operating System, and can request resources the same as any other processes. When memory contention is high on a host, and VM allocated memory would likely need to be swapped to disk (the .vswp file, see here for more information on that), the hypervisor will send a request to the VMTools process running on its guest VMs to try and reclaim memory.

Often, software running in an OS will not release memory when it is done with it, this can lead to memory being tied up in the OS, and can therefore not be released to the hypervisor to return to the pool of available host memory.

When the VMTools process receives the signal, it will request up to 65% of the host memory be relinquished to it, at which time it will release the memory back to the hypervisor. Because of the way we allocate memory to VMs (vRAM), the VM often has more memory available to it than it requires and as such, most VMs will have memory which can be returned.

Ballooning is a normal part of the automated memory management processes which ESXi provides, high amounts of ballooning do not usually indicate a problem, although large amounts of ballooning activity could tell you that memory is overcommitted.

The only time ballooning can cause a problem is when the ballooning driver reclaims memory pages required by the guest OS, in this case it can lead to swapping which could lead to performance degradation.

tl;dr – ballooning is a normal part of vSphere’s memory management, assisting in pushing up consolidation ratios. Don’t worry about it, but it is good to be aware of what it does, and how it works.