vSphere PowerCLI 6.3 Release 1 – in the wild…

Yesterday VMware released PowerCLI 6.3 Release 1, this follows yesterday’s fairly exciting release of new products across the VMware portfolio including:

  • vSphere 6.0 Update 2
  • vRealize Automation 7.0.1
  • vCloud Director 8.0.1

While I am not rushing to update to the new version of vCenter or ESXi in production environments, upgrading to this latest version of PowerCLI is far less risky, so I immediately upgraded and checked out the new features.

The latest release adds the following new support:

  • Support for Windows 10 and PowerShell 5.0 – as a Windows 10 user (for my personal laptop and home PC at least), this is a welcome addition. Windows Server 2016 is just around the corner as well, so this should ensure that PowerCLI 6.3R1 works here too. Not seen any problems with running the previous version of PowerCLI on my Windows 10 machines, but at least this is officially tested and supported now anyway
  • Support for vCloud Director 8.0 – VMware are driving vCD forward again, so if you are using the latest versions, and use PowerCLI to help make your life easier (and if you’re not, then why not?), this will be a welcome addition
  • Support for vRealize Operations Manager 6.2 – there are still only 12 cmdlets available in the VMware.VimAutomation.vROps module, but this bumps up support for the latest version anyway

And adds the following new features:

  • Added Content Library support – I haven’t really got into the whole Content Library thing just yet, but this feature introduced in vSphere 6.0, and was previously only automatable through the new vSphere REST API. This release of PowerCLI includes cmdlets to let you work with the Content Library, will probably do a follow up post on configuring the content library at a later date
  • Get-EsxCli functionality updated – for those that don’t know, Get-EsxCli lets you run esxcli commands via PowerShell on a target host. This is useful for certain things which are not really possible through the standard PowerCLI host management cmdlets. This release brings in advanced functionality in this area
  • Get-VM command – this command has been streamlined to more quickly return results, which should help in larger environments

So all in all, some minor improvements, some new features, and some updates to support for newer VMware products. A solid release which will keep PowerCLI relevant as a tool in a vSphere admin’s arsenal. If you’re not already using PowerCLI, then get on the bandwagon, there are some great books and videos out there, and a fantastic community to help you along.

Getting started with vRealize APIs

I have been doing automation with VMware products for a while, this has been mostly using PowerCLI, which I have blogged about in the past. In the last few weeks I have started to work with configuring and working with vRealize Orchestrator and vRealize Automation, and the hands off automation in these products is carried out using their respective REST APIs.

REST is something I’ve pretty much had zero experience of, so I have had to pick this up from nothing and figure out how to work in this fashion. My plan is to do a series of blog posts around how to get started with working with these interfaces, configuring and working with these products, and with tips on where to start. I’ve had some great support from my fellow automation people at work, but think I have a relatively good grip on how to do things now.

REST stands for ‘Representational State Transfer’ and uses the HTTP protocol as its basic interface, with this being a standard way to do things now, and with ports 80 and 443 (and the security around using them) generally being well understood, this basically means you should not have any issues with applications requiring complex numbers of ports being open to your application servers. REST is becoming somewhat ubiquitous now, with seemingly every new of hardware and software being released coming with its own REST interface.

This is great for people wanting to automate because we do not need any special software to do REST calls, and we can access the API from pretty much any OS and from anywhere with HTTP/HTTPS access to the endpoint. A REST call basically comprises of the following components:

  • URI (Uniform Resource Indicator) – basically a URL which exposes a function of the API, if you send a correctly crafted message to this address then you will get a standard HTTP response which should indicate whether your call was successful.
  • Body – this is also known as the payload. This contains instructions on what we want to do, and will usually (in my experience) be in either JSON (JavaScript Object Notation), or XML (eXtensible Markup Language). These are both markup languages which allow you to define the parameters for what you want to do. If you are sending file type data then the body may also be in multipart MIME format.
  • Headers – these carry information about the message you are sending such as data type, authentication information, desired response type, and are carried as a hash table of names and values.
  • Method – REST uses standard HTTP methods and which method you use will depend on the API and function you are accessing. Primarily this will probably be one of the following:
    • GET – this makes no changes, and is used for retrieving information. When you access a web page, the HTTP request uses the GET method, and by using this method with an API, you can generally expect to be returned an XML/JSON response with information about how a component is configured.
    • POST – this is used for sending information to an API to be processed. Again, you should expect a return code for this, and even if this is an error, there should (depending on the call, and the API) be something meaningful telling you what went wrong
    • PUT – often interchangeable with POST, this is generally used for replacing configuration in its entirety. Again you can reasonably respect some feedback when using this method.
    • DELETE – removes configuration from a specific element or component.

The exact way in which the call is sent depends on the exact API being used, and this is where the API being well documented is of crucial importance, because if documentation is lacking, or just wrong (as is the case much more than it should be), then getting the call to be successful can be challenging.

To get started you will need a tool which can craft calls for you, and show the response received. When starting out, it is easiest for this to be a graphical tool which gives instant and meaningful feedback on the response, although this is not great for doing things in an automated and programmable fashion, this does make things simpler, and makes the learning experience easier.

If you use Chrome or Firefox then you should be able to easily find some REST clients, and it may well be worth trying a few different ones until you find one which works best for you. Postman was recommended to me, and this has a nice graphical UI which will do code highlighting on the response, and will help you to build headers and the like.

Ultimately, if you are looking to automate, then you will be using a built in way of sending REST calls from your chosen automation language. My experience of doing this is either from BASH scripts (I know, I know) where I use curl, or through PowerShell, where you have the Invoke-RestMethod or Invoke-WebRequest cmdlets.

When using these kind of tools (or indeed, doing any kind of programmatic REST call), you will need to form the header yourself. As I said above, this is basically a hash table of key-value pairs.

So let’s do an example REST call. For the example below, I will be using the vRealize Orchestrator 7.0 API. This is a pretty new release, but it can be downloaded from the VMware website, and is deployed from OVA, so should be quick to download and spin up. The new version has two distinct APIs: one for the new Orchestrator Control Center, which is used to configure the appliance, and one for the vRO application itself, used for orchestration of a vSphere (and beyond!) environment. I will show this using both PowerShell code, and the Postman GUI.

vRealize Orchestrator has a fairly straight forward API, you can access the documentation by opening your browser and going to ‘https://<vro_ip&gt;:8281/vco/api/docs’, you will be presented with this screen:

1

From here we can explore any area exposed by the API, for the sake of this example we’re going to do something simple, we are going to return a list of the available workflows in the vRO system. So select ‘Workflow Service’, and click ‘GET /workflows’, and we can see a bit of information about the REST call to list workflows. This is what it shows:

1

This being a ‘GET’ call, we don’t see a lot here, but we will run the call anyway, and see what we get back, in later articles we will go through changing configuration. First we will make the call using PowerShell, the script is as follows, this is pretty simple:

# Ignore SSL certificates
# NOTE: This should not really be required, if you are using
# proper certificates, and have working DNS
Add-type @"
using System.Net;
using System.Security.Cryptography.X509Certificates;
public class TrustAllCertsPolicy : ICertificatePolicy {
public bool CheckValidationResult(
ServicePoint srvPoint, X509Certificate certificate,
WebRequest request, int certificateProblem) {
return true;
}
}
"@
[System.Net.ServicePointManager]::CertificatePolicy = New-Object TrustAllCertsPolicy
# Define our credentials
$user_name = "vcoadmin";
$password = "vcoadmin";
$vro_server_name = "192.168.1.201"
# Convert the credentials to a format we can use in the
# header.
$auth = $user_name+":"+$password;
$Encoded = [System.Text.Encoding]::UTF8.GetBytes($auth);
$EncodedPassword = [System.Convert]::ToBase64String($Encoded)
$header = @{"Authorization"="Basic $($EncodedPassword)"};
# Define the URI for our REST call
$vro_getplugin_uri = "https://"+$vro_server_name+":8281/vco/api/plugins"
# Run the call and store the result in a variable
$get_plugins = Invoke-WebRequest -Uri $vro_getplugin_uri -Method Get -Headers $header -ContentType "application/json"

This returns the result of the call to a variable, unfortunately, although this is in JSON format, it will be horribly formatted if you try to output it:

1.png

If we run:

$get_plugins.content | ConvertFrom-Json | ConvertTo-Json

Then the built in JSON formatting will reformat it to make it readable:

1

This gives us a readable list that we can refer to, filter on, and do all the cool stuff that PowerShell makes simple.

We will now look at doing the same thing with Postman, so first you will need to install the Postman plugin from the Chrome App Store, then open it, you will see this:

1

We know our credentials and URI, so we’re just going to go ahead and step through making the request. First, enter the URI in the box:

1

Now we need to add our authentication method, so change ‘No Auth’ to ‘Basic Auth’, enter the credentials for the ‘vcoadmin’ user, and click ‘Update request’:

1

You will notice that ‘Headers (0)’ changed to ‘Headers(1)’, if you click on that you can see the header for the request, with the encoded credential:

1

Now click ‘Send’ and our REST call will be submitted, the same JSON we got earlier will be nicely formatted and displayed in the lower pane:

1

And that’s a REST call completed, with the result returned, and we know how to do it using PowerShell, to make programmatic calls, and with Postman, to explore and understand APIs.

PowerCLI – where to start

I began using PowerShell around 18 months ago while working for a small UK based Managed Service provider. Prior to this, my coding/scripting experience consisted of an A-Level in Computing, which introduced me to Visual Basic 6.0 and databases, a void of around 7 years, and then some sysadmin VBScript and batch file type goodness for a few years.

Screen Shot 2015-11-02 at 21.34.14

Until I started at said company, I had only been exposed to systems running Windows Server 2003, and with a look to security über alles, no access to PowerShell, or any other exciting languages was available, so VBScript became our automation tool of choice.

I have posted before about good resources to use to learn PowerShell, this is more a rundown of how I learned, and the joy and knowledge it gave me to do this.

My first taste of PowerShell was working with Exchange 2010 servers, doing stuff like this to report on mailbox items over a certain age.

Get-Mailbox "username" | New-MailboxSearch -Name search123 -SearchQuery "Received:<01/01/2014" -estimateonly

Were it not for the necessity to use PowerShell to do anything remotely useful in Exchange 2010, I would have been happy to continue to use batch files and VBScript to automate some of the things, I was confident in using these tools, and could achieve time savings, albeit fairly slowly. But PowerShell I must, so PowerShell I did.

Around this time, I became more keen on working with infrastructure, than applications, and got transferred to a role solely looking after our fairly sizeable Cisco UCS and VMware estate. I had plenty of years of experience of VMware, and none of Cisco UCS, but was excited by the new challenge.

I was quickly steered by the senior engineers, towards Cisco PowerTool, and VMware’s PowerCLI, to help to automate some of the administrative, and reporting type tasks I would soon be inundated with, so I picked them up and learned as I went.

I started small, and Google was my friend. Scripting small tasks to save incrementally larger amounts of time. Stuff like this:


$podcsv = import-csv .\UCS_Pods.csv
$credcsv = import-csv .\UCS_Credentials.csv
$ucsuser = $credcsv.username
$ucspasswd = $credcsv.password
$secpasswd = convertto-securestring $ucspasswd -asplaintext -force
$ucscreds = new-object system.management.automation.pscredential ($ucsuser,$secpasswd)
$datetime = get-date -uformat “%C%y%m%d-%H%M”
foreach($pod in $podcsv)
{
$podname=$pod.name
$podip=$pod.ip
connect-ucs -credential $ucscreds $podip
get-ucsfault | select ucs,id,lasttransition,descr,ack,severity | export-csv -path .\$datetime-$podname-errors.csv
disconnect-ucs
}

To dump out the alerts we had in multiple UCS systems, to CSV files. This would save 20-30 minutes a day, nothing major, but clicking buttons is boring, and I can always find better things to do with my time.

On the VMware side of things, I started really small, with stuff like this which would tell you the version of VMTools on all of your virtual machines:


# Ask for connection details, then connect using these
$vcenter = Read-Host "Enter vCenter Name or IP"
$username = Read-Host "Enter your username"
$password = Read-Host "Enter your password"
# Set up our constants for logging
$datetime = get-date -uformat "%C%y%m%d-%H%M"
$outfilepsp = $(".\" + $datetime + "_" + $vcenter + "_PSPList_Log.txt")
$outfilerdm = $(".\" + $datetime + "_" + $vcenter + "_RDMList_Log.txt")
$OutputFile = ".\" + $datetime + "_" + $vcenter + "_VMTools_Report.txt"
# Connect to vCenter
$Connection = Connect-VIServer $vcenter #-User $username -Password $password
foreach($Cluster in Get-Cluster) {
foreach($esxhost in ($Cluster | Get-VMHost | Where { ($_.ConnectionState -eq "Connected") -or ($_.ConnectionState -eq "Maintenance")} | Sort Name)) {
Get-Cluster | Get-VMhost $esxhost | get-vm | % { get-view $_.id } | select Name, @{ Name="ToolsVersion"; Expression={$_.config.tools.toolsVersion}}, @{ Name="ToolStatus"; Expression={$_.Guest.ToolsVersionStatus}}, @{Name="Host";Expression={$esxhost}}, @{Name="Cluster";Expression={$cluster.name}} | Format-Table | Out-File -FilePath $OutputFile -Append
}
}
Disconnect-VIServer * -Confirm:$false

This is a real time saver, and great for getting quick figures out of your environment. As I wrote these scripts, I learned more and more what I could do, picking up ways of doing different things here and there: for/next loops, do/while loops, arrays. As I picked up these concepts again, concepts I had learned years earlier and not used to great effect, my scripts became more complex, and delivered more value in the output they gave, and the time saved. Scripts like this which reports on any datastores over 90% utilisation, these soon became a part of our daily reporting regime:


$datetime = get-date -uformat "%C%y%m%d-%H%M"
$vcentercsv = import-csv .\VCenter_Servers.csv
# Configure connection settings using Read Only account
$credcsv = import-csv .\VMware_Credentials.csv
$vmuser = $credcsv.username
$vmpasswd = $credcsv.password
$secpasswd = convertto-securestring $vmpasswd -asplaintext -force
$vmcreds = new-object system.management.automation.pscredential ($vmuser,$secpasswd)
$report = @()
foreach($vcenter in $vcentercsv)
{
$vcentername=$vcenter.name
connect-viserver $vcenter.ip -credential $vmcreds
foreach ($datastore in (get-datastore | where {$_.name -notlike "*local*" -and [math]::Round(100-($_.freespacegb/$_.capacitygb)*100) -gt 90}))
{
$row = '' | select Name,FreeSpaceGB,CapacityGB,vCenter,PercentUsed
$row.Name = $datastore.name
$row.FreeSpaceGB = $datastore.freespacegb
$row.CapacityGB = $datastore.capacitygb
$row.vCenter = $vcenter.name
$row.PercentUsed = [math]::Round(100-($datastore.freespacegb/$datastore.capacitygb)*100)
$report += $row
}
Disconnect-VIServer * -Confirm:$false
}
$report | Sort PercentUsed | export-csv -path .\$datetime-datastore-overuse.csv

My knowledge of how to do things, and confidence in what I was doing grew rapidly, and the old thing of ‘the more I know, the more I realise I don’t know’ came to pass. I am still learning at a rapid rate how better to put these things together, and new cmdlets, new modules, new ways to do things. It’s a fun journey though, one which leaves you with extremely useful and admired skills, and one which will continue to develop you as an IT technician throughout your career.

I am now doing the biggest PowerShell datacenter automation project I have ever done, it is around 5000 lines now, and growing every day. I feel like anything can be achieved with PowerShell, and the various modules released by vendors, and finding ways of solving the constant puzzles which hit me in the face is exciting and rewarding in equal measure.

Everywhere you look in IT now, it is automation and DevOps. It has been said many times that IT engineers who do not learn some form of automation are going to be automated out of a job, and to some extent I agree with this. The advent of software defined storage, networking, everything, shows that automation, and policy driven configuration, is really changing the world of IT infrastructure. If you’re in IT then you probably got in because you love technology, well get out there and learn new skills, whatever those may be, you will enjoy it more than you think.

Integrating Platform Services Controller/vCSA 6.0 with Active Directory using CLI

I am currently automating the build and configuration of a VMware vCenter environment, and the internet has been a great source of material in helping me with this, particularly William Lam and Brian Graf‘s websites. It seems VMware have done a great job with the changes in vSphere 6.0 of enabling automated deployments, this follows the general trends of the industry in driving automation and orchestration with everything we do in the Systems Administration world.

One thing I needed to do which I could not find any information on, was to join my standalone Platform Services Controller (PSC) to an AD domain, this is easy enough in the GUI, and is documented here. It was important for me to automate this however, so I trawled through the CLI on my PSC to figure out how to do this.

I stumbled across the following command which joins you to the AD domain of your choosing.

/usr/lib/vmware-vmafd/bin/vmafd-cli join-ad --server-name <server name> --user-name <user-name> --password <password> --machine-name <machine name> --domain-name <domain name>

Once this is completed the PSC will need restarting, to enable the change. This will add the PSC to Active Directory. The next challenge was finding a scripted method to add the identity source. Once the identity source is added, permissions can be set up as normal in vCenter using this identity source.

Again, I had to trawl through the PSC OS to find this, the script is as follows:

/usr/lib/vmidentity/tools/scripts/sso-add-native-ad-idp.sh <Native-Active-Dir-Domain-Name>

Both of these can be carried out through an SSH session to your PSC (or embedded PSC/VCSA server). Assuming you have BASH enabled on your PSC, you can invoke this remotely using the PowerCLI ‘Invoke-VMScript’ cmdlet. This should help in the process of fully automating the process of deploying a vCenter environment.

As an aside, one issue I did have, which is discussed in the VMware forums is that I was getting the error ‘Error while extracting local SSO users’ when enumerating users/groups from my AD in the VMware GUI, this was fixed by creating a PTR record in DNS in my domain for the Domain Controller, it seems this is needed by the new VMware SSO at some point.

I hope this is useful to people, and hopefully VMware will document this sort of automation in the future, in the meantime, as I said above, William Lam and Brian Graf’s sites are a good source of information.

Transparent Page Sharing – for better or worse

Transparent Page Sharing (TPS) is one of the cornerstones in memory management in the ESXi hypervisor, this is one of the many technologies VMware have developed, which allows higher VM consolidation ratios on hosts through intelligent analysis of VM memory utilisation, and deduplication based on this analysis.

There are two types of TPS, intra-VM and inter-VM. The first scans the memory in use by a single VM and deduplicates common block patterns, this will lower the total host memory consumption of a VM, without significantly impacting VM performance. The second type; inter-VM TPS, does the same process, looking for common blocks across memory usage for all VMs on a given host.

Historically this has been a successful reclamation technique, and has led to decent savings in host memory consumption, however, most modern Operating Systems seen in virtualised environments (most Linux distributions, Windows Server 2008 onwards) now use memory encryption by default so the chance of the TPS daemon finding common blocks becomes less and less likely.

If CPU integrated hardware-assisted memory virtualisation features (AMD Rapid Virtualisation Indexing (RVI) or Intel Extended Page Tables (EPT)) are utilised for an ESXi host, then the hypervisor will use 2MB block size for its TPS calculations, rather than the normal 4KB block size. Attempting to deduplicate in 2MB chunks is far more resource intensive, and far less successful, than running the same process in 4KB chunks, thus ESXi will not attempt to deduplicate and share large memory pages by default.

The upshot of this is that 2MB pages are scanned, and 4KB blocks within these large pages  hashed in preparation for inducing memory sharing should the host come under memory contention, in an effort to prevent sharing. Pre-hashing these 4KB chunks will mean that TPS is able to quickly react, deduplicating and sharing the pages should the need arise.

All good so far, this technique should help us to save a bit of memory, although since memory virtualisation features in modern CPUs are widespread, and larger amounts of host memory more common, the potential TPS savings should hopefully never be needed or seen.

At the back end of last year, VMware announced that they would be disabling TPS by default in ESXi following an academic paper which showed a potential security vulnerability in TPS which, if exploited, could result in sensitive data being made available from one VM to another utilising memory sharing. It should be noted that the technique used to exploit this is in a highly controlled laboratory style condition, and requires physical access to the host in question, and that the researcher highlighting it never actually managed to glean any data in this method.

Despite the theoretical nature of the published vulnerability, VMware took the precautionary approach, and so in ESXi 5.0, 5.1, 5.5 and now 6.0, with the latest updates, TPS is now disabled by default. What does this mean for the enterprise though, and what choices do we have?

1.Turn TPS on regardless – if you have a dedicated internal-only infrastructure, then it may be that you do not care about the risks exposed by the research. If you and your team of system administrators are the only ones with access to your ESXi servers, and the VMs within, as is common in many internal IT departments, then there are likely far easier ways to get access to sensitive data than by utilising this theoretical technique anyway

2.Turn off TPS – if you are in a shared Service Provider style infrastructure, or in an environment requiring the highest security, then this should be a no-brainer. The integrity and security of the client data you have on your systems should be foremost in your organisations mind, and in the interests of precaution, and good practice, you should disable TPS on existing systems and leave it off

This were the options presented by VMware until about a week ago, when an article was published which described a third option being introduced in the next ESXi updates:

3. Use TPS with guest VM salting – this allows selective inter-VM TPS enablement among sets of selected VMs, allowing you to limit the potential areas of vulnerability, using a combination of edits to .vmx files, and advanced host settings. This may be a good middle ground if you are reliant on the benefits provided by TPS, but security policy demands that it not be in the previous default mode

So these are our options, and regardless of which one you choose, you need to know what the difference in your environment will be if you turn off TPS, this is going to be different for everyone. Your current savings being delivered by TPS can be calculated, and this should give you some idea of what the change in host memory utilisation will be following the disabling of TPS.

The quickest way to see this for an individual host is via the Performance tab in vSphere Client, if you look at real time memory usage, and select the ‘Active’, ‘Shared’ and ‘Shared Common’ counters then you will be able to see how much memory is consumed in total by your host, and how much of this is being saved through TPS:

TPS_vSphere_Client

Here we can see:

TPS %age saving = (Shared – Shared common) / (Consumed – Used by VMkernel) * 100% = (5743984 – 112448) / (160286356 – 3111764) * 100% = 5631536 / 157174592 * 100% = 3.58%

So TPS is saving around 5.6GB or 3.6% of total memory on the host being consumed by VMs. This is a marker of the efficiency of TPS.

The same figures can be taken from esxtop if you SSH to an ESXi host, run esxtop, and press ‘m’ to get to memory view.

TPS_esxtop

Here we are looking at the PSHARE value, we can see the saving is 5607MB (ties up with above from the vSphere Client), and the memory consumed by VMs can be seen under PMEM/other, in this case 153104MB. Again we can calculate the percentage saving TPS is giving us by dividing the saving by the active memory and multiplying by 100%:

TPS %age saving = PSHARE saving / PMEM other * 100% = 5607 / 153104 * 100% = 3.66%

So this is how we can calculate the saving for each host, but what if you have dozens, or hundreds of hosts in your environment, wouldn’t it be great to get these stats for all your hosts? Well, the easiest way to get this kind of information is usually through PowerCLI so I put the following script together:


# Ask for connection details, then connect using these
$vcenter = Read-Host "Enter vCenter Name or IP"


# Set up our constants for logging
$datetime = get-date -uformat "%C%y%m%d-%H%M"
$OutputFile = ".\" + $datetime + "_" + $vcenter + "_TPS_Report.csv"

# Connect to vCenter
$Connection = Connect-VIServer $vcenter

$myArray = @()

forEach ($Cluster in Get-Cluster) {
foreach($esxhost in ($Cluster | Get-VMHost | Where { ($_.ConnectionState -eq "Connected") -or ($_.ConnectionState -eq "Maintenance")} | Sort Name)) {
$vmdetails = "" | select hostname,clustername,memsizegb,memshavg,memshcom,tpssaving,percenttotalmemsaved,tpsefficiencypercent
$vmdetails.hostname = $esxhost.name
$vmdetails.clustername = $cluster.name
$hostmem = Get-VMHost $esxhost | Select -exp memorytotalgb
$vmdetails.memsizegb = "{0:N0}" -f $hostmem
$vmdetails.memshavg = [math]::Round((Get-VMhost $esxhost | Get-Stat -Stat mem.shared.average -MaxSamples 1 -Realtime | Select -exp value),2)
$vmdetails.memshcom = [math]::Round((Get-VMhost $esxhost | Get-Stat -Stat mem.sharedcommon.average -MaxSamples 1 -Realtime | Select -exp value),2)
$vmdetails.tpssaving = $vmdetails.memshavg-$vmdetails.memshcom
$vmdetails.percenttotalmemsaved = [math]::Round(([int]$vmdetails.tpssaving/([int]$vmdetails.memsizegb*1024*1024))*100,2)
$consumedmemvm = [math]::Round(((Get-VMhost $esxhost | Get-Stat -Stat mem.consumed.average -MaxSamples 1 -Realtime | Select -exp value)-(Get-VMhost $esxhost | Get-Stat -Stat mem.sysUsage.average -MaxSamples 1 -Realtime | Select -exp value)),2)
$vmdetails.tpsefficiencypercent = [math]::Round(([int]$vmdetails.tpssaving/$consumedmemvm)*100,2)
$myArray += $vmdetails
}
}
Disconnect-VIServer * -Confirm:$false

$myArray | Sort Name | Export-Csv -Path $Outputfile

This script will dump out a CSV with every host in your vCenter, and tell you the percentage of total host memory saved by TPS, and the efficiency of TPS in your environment. This should help to provide some idea of what the impacts of TPS being turned off will be.

Ultimately, your organisation’s security policies should define what to do after the next ESXi updates, and how you should act in the meantime, TPS is definitely a useful feature, and does allow for higher consolidation ratios, but security vulnerabilities should not be ignored. Hopefully this post will give you an idea of how TPS is currently impacting your infrastructure.

Ballooning – the lowdown

VMware Tools does many things for us as administrators, automating much of the resource management and monitoring we need so we can have confidence in managing the sprawling clusters of a private cloud.

Below are some of the benefits VMware Tools affords us:

  • Driver optimisation
  • Power settings management
  • Time synchronisation
  • Advanced memory management
  • VM Heartbeating

VMware Tools is available for every supported guest Operating System, and with the exception of certain appliances (which often come with 3rd Party tools installed anyway), installation of VMTools should be common practice.

In this article I am going to talk about one specific memory management technique: ballooning. Ballooning is a method for reclaiming host memory in times of contention, this allows for more workloads to run on a host without resorting to using swapping.

The VMTools service/daemon runs as any other process does in an Operating System, and can request resources the same as any other processes. When memory contention is high on a host, and VM allocated memory would likely need to be swapped to disk (the .vswp file, see here for more information on that), the hypervisor will send a request to the VMTools process running on its guest VMs to try and reclaim memory.

Often, software running in an OS will not release memory when it is done with it, this can lead to memory being tied up in the OS, and can therefore not be released to the hypervisor to return to the pool of available host memory.

When the VMTools process receives the signal, it will request up to 65% of the host memory be relinquished to it, at which time it will release the memory back to the hypervisor. Because of the way we allocate memory to VMs (vRAM), the VM often has more memory available to it than it requires and as such, most VMs will have memory which can be returned.

Ballooning is a normal part of the automated memory management processes which ESXi provides, high amounts of ballooning do not usually indicate a problem, although large amounts of ballooning activity could tell you that memory is overcommitted.

The only time ballooning can cause a problem is when the ballooning driver reclaims memory pages required by the guest OS, in this case it can lead to swapping which could lead to performance degradation.

tl;dr – ballooning is a normal part of vSphere’s memory management, assisting in pushing up consolidation ratios. Don’t worry about it, but it is good to be aware of what it does, and how it works.

Storage I/O Control – what to expect

Storage I/O Control, or SIOC, was introduced into vSphere back in vSphere 4.1, it provides a way for vSphere to combat what is known as the ‘noisy neighbour’ syndrome. This describes the situation where multiple VMs reside on a single datastore, and one or more of these VMs take more than their fair share of bandwidth to the datastore. This could be happening because a VM decides to misbehave, because of poor choices in VM placement, or because workloads have changed.

The reigning principle behind SIOC is one of fairness, allowing all VMs a chance to read and write without being swamped by one or more ‘greedy’ VMs. This is something which, in the past, would have been controlled by disk shares, and indeed this method can still be used to prioritise certain workloads on a datastore over others. The advantage with SIOC is that, other than the couple of configurable settings, described below, no manual tinkering is really required.

Options available for Storage I/O Control
Options available for Storage I/O Control

There are only two settings to pick for SIOC:

1) SIOC Enabled/Disabled – either turn SIOC on, or off, at the datastore level. More on considerations for this further down

2) Congestion Threshold – this is the trigger point at which SIOC will kick in and start doing its thing, throttling I/O to the datastore. This can be configured with one of two types of value:

a) Manual – this is set in milliseconds and this defaults at 30ms, but is variable depending on your storage. VMware have tables on how to calculate this in their SIOC best practice guide, but the default should be fine for most situations. If in doubt then your storage provider should be able to give guidance on the correct value to choose.

b) Percentage of peak throughput – this is only available through the vSphere Web Client, and was added in vSphere 5.1, this takes the guess work out of setting the threshold, replacing it with an automated method for vSphere to analyse the datastore I/O capabilities and use this to determine the peak throughput.

My experience of using SIOC is described in the following paragraphs, improvements were seen, and no negative performance experienced (as expected), although some unexpected results were received.

Repeated latency warnings similar to the following from multiple hosts were seen, for multiple datastores across different storage systems:

Device naa.5000c5000b36354b performance has deteriorated. I/O latency increased from average value of 1832 microseconds to 19403 microseconds

These warnings report the latency time in microseconds, so in the above example, the latency is going from 1.8ms to 19ms, still a workable latency, but the rise is flagged due to the large increase (in this case by a factor of ten). The results seen in the logs were much worse than this though, sometimes latency was rising to as much as 20 seconds, this was happening mostly in the middle of the night

After checking out the storage configuration, it was identified that Storage I/O Control was turned off across the board. This is set to disabled by default for all datastores and as such, had been left as was. Turning SIOC on seemed like a sensible way forward so the decision was taken to proceed in turning it on for some of the worst affected datastores.

After turning on SIOC on a handful of datastores, a good reduction in the number of I/O latency doublings being reported in the ESXi logs was seen. Unfortunately a new message began to flag in the host events logs:

Non-VI workload detected on the datastore

This was repeatedly seen against the LUNs for which SIOC had been enabled, VMware have a knowledge base article for this which describes the issue. In this case, the problem stemmed from the fact that the storage backend providing the LUNs had a single disk pool (or mDisk Group, as this was presented by an IBM SVC) which was shared with unmanaged RDMs, and other storage presented outside the VMware environment.

The impact of this is that, whilst VMware plays nicely, throttling I/O access when threshold congestion is reached, other workloads such as non-SIOC datastores, RDMs, or other clients of the storage group, will not be so fair in their usage of the available bandwidth. This is due to the spindles presented being shared, one solution to this would be to present dedicated disk groups to VMware workloads, ensuring that all datastore carved out of these disks have SIOC turned on.

We use EMC VNX, and IBM SVC as our storage of choice, recommendations from both these vendors is to turn SIOC on for all datastores, and to leave it on. I can only imagine that the reason this is still not a default is because it is not suitable for every storage type. As with all these things, checking storage vendor documentation is probably the best option, but SIOC should provide benefit in most use cases, although as described above, you may see some unexpected results. It is worth noting that this feature is Enterprise Plus only, so anyone running a less feature packed version of vSphere will not be able to take advantage of this feature.

vNUMA CPU Alignment – doing it right

As I’ve said before, the majority of our VMware environment is running on Cisco UCS blade servers, and the majority of these are running dual hex-core CPUs. With the broad spectrum of Operating Systems and applications running across our many hundreds of VMs, there are inevitably many, many VMs with multiple vCPUs.

This shows the CPU Ready/Usage stats before and after re-aligning vCPU configuration from single core-multiple sockets, to multiple core-single socket

NUMA, in a nutshell, utilises the host CPU’s local memory bus to allow faster memory access time for workloads on that specific CPU. This is particularly important for latency sensitive workloads. vNUMA is VMware’s implementation of utilising NUMA to reduce memory latency for VMs running on ESXi.

When looking at CPU performance issues with a guest VM, no host level contention for CPU resources existed. Through the VM performance graphs in the vSphere client, CPU Ready times could be seen to be spiking often, this was on a VM with multiple CPUs set to 2 x socket and 8 x cores (16 vCPU).

This shows the CPU Ready/Usage stats before and after re-aligning vCPU configuration from single core-multiple sockets, to multiple core-single socket; a marked difference

 

This problem is fairly well documented, and is detailed in VMware KB 1026063; when utilising the NUMA features of VMware, it is important to configure the vCPU layouts for your VMs to align with the physical characteristics of your hosts if possible; this will help to guarantee that the VM guest can be vMotioned to other hosts in the cluster, with different physical CPU configurations. A better solution, where identical CPU configurations can not be guaranteed across all your hosts, is to define all vCPUs as x sockets with a single core.

In this case our physical CPU architecture is 2 sockets x 6 cores, and the VM was configured with 1 socket x 8 cores. This prevents the hypervisor and guest OS from completely utilising either one, or both sockets in our physical host, and is therefore missing out on the speed benefits which NUMA can bring. And can be exhibited as high or consistently high ready States for this VM guest.

There is a great VMware blog article about the performance implications of vNUMA design selection here, which echoes the VMware Best Practices guide, stating that the best way to approach this is to set your vCPU configuration ‘flat and wide’. this means that if your VM requires 8 cores, then configure your vCPU with 8 sockets with 1 core each, rather than 2 quad-core or 4 dual-core sockets.

This allows the vNUMA technology to balance the load as it best sees fit and should prevent CPU Ready spike issues. There are of course edge cases, where specific software licensing may force you to use as few sockets as possible, in which case alignment to physical host CPU should always be attempted, in the case above, the VM could have been configured to 1 socket x 6 cores, or 2 sockets x 6 cores. Be aware, that stepping outside of the ‘flat and wide’ model will prevent vNUMA from doing its job, and will bow to your judgement of vCPU configuration; this means you had better have got it right!