Unable to see identity providers in vRA 7.0.x

I have seen a weird issue which seems to have come along in vRA 7.0.1, to do with roles and authorization. In my environment I have delegated the Tenant Administrator role to an Active Directory group, named ‘vRA-TenantAdmins’, of which my user account is a member. This shows when I look at my user account through ‘Users and Groups’ (the square rather than a tick indicates this permission is implicit):

Roles_1

Now, I can do the stuff a Tenant Administrator should be able to do, with some weird exceptions. For example, when I try to look at what directories have been added to vIDM, the interface just hangs at refreshing the list of directories:

Dir_Hanging

And the same when I look at identity providers:

Provider_hanging

And I can’t do login screen branding (although header and footer branding works fine!):

branding_fail

I smashed my face off this problem for a few hours, but turns out the fix was fairly simple (although this should be unnecessary). If I go to my account again, under ‘Users and Groups’, and add my account explicitly to the ‘Tenant Administrator’ role, then the functionality all mysteriously works.

Roles_2

This is pretty annoying, as I want to do Role Based Access Control (RBAC), using Active Directory to control access for user accounts. Hopefully this will be fixed in the next release of vRealize Automation, and hoping this post helps people seeing the same obscure behaviour I did.

vSphere HTML5 Client Fling Deployment Script

So yesterday, VMware released the HTML5 vSphere Client as a fling, this is available for download here. I have put together a PowerShell script to deploy this to your vSphere environment.

It seems unusual for this to take the form of an OVA, but at least this means that it does not touch your existing vCenter, so should be deployable with less apprehension.

The client itself is issued an IP address from an IP Pool, and therefore has a different IP address to access from vCenter. Deployment of the OVA is pretty straight forward, and instructions for use and setup are in the link above.

There are already a tonne of posts around the features present, and not present, in the vSphere HTML5  Client, I am not going to go over that here, suffice to say, it is a fling for a reason.

This script (at first release) assumes that a valid, enabled IP Pool already exists in vCenter for the IP you allocate to the VM, I will add functionality in the next release to create an IP Pool if one is not already present.

Other than that, you should just need to replace the variables at the top of the script to use it for deployment. The script is available on my GitHub repository at this link.

 

PowerShell – Could not create SSL/TLS secure channel

I have spent a considerable amount of time in my life battling with the above error message when running PowerShell scripts. Long and short of it is that this can be caused by a few things, but most of the times I have experienced it, the reason is that the endpoint you are trying to connect to is using self-signed certificates, which causes the Invoke-WebRequest, and Invoke-RestMethod commands to throw an error stating:

The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel.

If you hit this, you will know as your web request via standard REST methods will simply refuse to give you anything back.

I had a bunch of scripts written to do automation of the configuration of vRealize Orchestrator, and vRealize Automation 7.0, and these had been heavily tested, and confirmed as working. The way of avoiding the above error is to use the following PowerShell function:

function Ignore-SelfSignedCerts
{
try
{
Write-Host "Adding TrustAllCertsPolicy type." -ForegroundColor White
Add-Type -TypeDefinition @"
using System.Net;
using System.Security.Cryptography.X509Certificates;
public class TrustAllCertsPolicy : ICertificatePolicy
{
public bool CheckValidationResult(
ServicePoint srvPoint, X509Certificate certificate,
WebRequest request, int certificateProblem)
{
return true;
}
}
"@
Write-Host "TrustAllCertsPolicy type added." -ForegroundColor White
}
catch
{
Write-Host $_ -ForegroundColor "Yellow"
}
[System.Net.ServicePointManager]::CertificatePolicy = New-Object TrustAllCertsPolicy
}
Ignore-SelfSignedCerts;

So not a great start to my Sunday when I found that my scripts no longer worked after a fresh install of the recently released vRealize Orchestrator and vRealize Automation 7.0.1.

After much messing about, I worked out the cause of this, which is that SSLv3 and TLSv1.0 were both disabled in the new releases, as a result we need to either:

a) Enable SSLv3 or TLSv1.0 – probably not the best idea, these have been disabled due to the growing number of security risks in these protocols, and will (presumably) continue to be disabled for every new version of the products going forward

b) Change the way we issue requests, to use TLSv1.2 – this is the way to do it in my opinion, and the code to do this is a simple one-liner:

[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12;

So if you  hit this problem (and if you are a PowerShell scripter, and interacting with REST APIs with your scripts, then you probably will!), then this is how to fix the issue.

vSphere PowerCLI 6.3 Release 1 – in the wild…

Yesterday VMware released PowerCLI 6.3 Release 1, this follows yesterday’s fairly exciting release of new products across the VMware portfolio including:

  • vSphere 6.0 Update 2
  • vRealize Automation 7.0.1
  • vCloud Director 8.0.1

While I am not rushing to update to the new version of vCenter or ESXi in production environments, upgrading to this latest version of PowerCLI is far less risky, so I immediately upgraded and checked out the new features.

The latest release adds the following new support:

  • Support for Windows 10 and PowerShell 5.0 – as a Windows 10 user (for my personal laptop and home PC at least), this is a welcome addition. Windows Server 2016 is just around the corner as well, so this should ensure that PowerCLI 6.3R1 works here too. Not seen any problems with running the previous version of PowerCLI on my Windows 10 machines, but at least this is officially tested and supported now anyway
  • Support for vCloud Director 8.0 – VMware are driving vCD forward again, so if you are using the latest versions, and use PowerCLI to help make your life easier (and if you’re not, then why not?), this will be a welcome addition
  • Support for vRealize Operations Manager 6.2 – there are still only 12 cmdlets available in the VMware.VimAutomation.vROps module, but this bumps up support for the latest version anyway

And adds the following new features:

  • Added Content Library support – I haven’t really got into the whole Content Library thing just yet, but this feature introduced in vSphere 6.0, and was previously only automatable through the new vSphere REST API. This release of PowerCLI includes cmdlets to let you work with the Content Library, will probably do a follow up post on configuring the content library at a later date
  • Get-EsxCli functionality updated – for those that don’t know, Get-EsxCli lets you run esxcli commands via PowerShell on a target host. This is useful for certain things which are not really possible through the standard PowerCLI host management cmdlets. This release brings in advanced functionality in this area
  • Get-VM command – this command has been streamlined to more quickly return results, which should help in larger environments

So all in all, some minor improvements, some new features, and some updates to support for newer VMware products. A solid release which will keep PowerCLI relevant as a tool in a vSphere admin’s arsenal. If you’re not already using PowerCLI, then get on the bandwagon, there are some great books and videos out there, and a fantastic community to help you along.

Getting started with vRealize APIs

I have been doing automation with VMware products for a while, this has been mostly using PowerCLI, which I have blogged about in the past. In the last few weeks I have started to work with configuring and working with vRealize Orchestrator and vRealize Automation, and the hands off automation in these products is carried out using their respective REST APIs.

REST is something I’ve pretty much had zero experience of, so I have had to pick this up from nothing and figure out how to work in this fashion. My plan is to do a series of blog posts around how to get started with working with these interfaces, configuring and working with these products, and with tips on where to start. I’ve had some great support from my fellow automation people at work, but think I have a relatively good grip on how to do things now.

REST stands for ‘Representational State Transfer’ and uses the HTTP protocol as its basic interface, with this being a standard way to do things now, and with ports 80 and 443 (and the security around using them) generally being well understood, this basically means you should not have any issues with applications requiring complex numbers of ports being open to your application servers. REST is becoming somewhat ubiquitous now, with seemingly every new of hardware and software being released coming with its own REST interface.

This is great for people wanting to automate because we do not need any special software to do REST calls, and we can access the API from pretty much any OS and from anywhere with HTTP/HTTPS access to the endpoint. A REST call basically comprises of the following components:

  • URI (Uniform Resource Indicator) – basically a URL which exposes a function of the API, if you send a correctly crafted message to this address then you will get a standard HTTP response which should indicate whether your call was successful.
  • Body – this is also known as the payload. This contains instructions on what we want to do, and will usually (in my experience) be in either JSON (JavaScript Object Notation), or XML (eXtensible Markup Language). These are both markup languages which allow you to define the parameters for what you want to do. If you are sending file type data then the body may also be in multipart MIME format.
  • Headers – these carry information about the message you are sending such as data type, authentication information, desired response type, and are carried as a hash table of names and values.
  • Method – REST uses standard HTTP methods and which method you use will depend on the API and function you are accessing. Primarily this will probably be one of the following:
    • GET – this makes no changes, and is used for retrieving information. When you access a web page, the HTTP request uses the GET method, and by using this method with an API, you can generally expect to be returned an XML/JSON response with information about how a component is configured.
    • POST – this is used for sending information to an API to be processed. Again, you should expect a return code for this, and even if this is an error, there should (depending on the call, and the API) be something meaningful telling you what went wrong
    • PUT – often interchangeable with POST, this is generally used for replacing configuration in its entirety. Again you can reasonably respect some feedback when using this method.
    • DELETE – removes configuration from a specific element or component.

The exact way in which the call is sent depends on the exact API being used, and this is where the API being well documented is of crucial importance, because if documentation is lacking, or just wrong (as is the case much more than it should be), then getting the call to be successful can be challenging.

To get started you will need a tool which can craft calls for you, and show the response received. When starting out, it is easiest for this to be a graphical tool which gives instant and meaningful feedback on the response, although this is not great for doing things in an automated and programmable fashion, this does make things simpler, and makes the learning experience easier.

If you use Chrome or Firefox then you should be able to easily find some REST clients, and it may well be worth trying a few different ones until you find one which works best for you. Postman was recommended to me, and this has a nice graphical UI which will do code highlighting on the response, and will help you to build headers and the like.

Ultimately, if you are looking to automate, then you will be using a built in way of sending REST calls from your chosen automation language. My experience of doing this is either from BASH scripts (I know, I know) where I use curl, or through PowerShell, where you have the Invoke-RestMethod or Invoke-WebRequest cmdlets.

When using these kind of tools (or indeed, doing any kind of programmatic REST call), you will need to form the header yourself. As I said above, this is basically a hash table of key-value pairs.

So let’s do an example REST call. For the example below, I will be using the vRealize Orchestrator 7.0 API. This is a pretty new release, but it can be downloaded from the VMware website, and is deployed from OVA, so should be quick to download and spin up. The new version has two distinct APIs: one for the new Orchestrator Control Center, which is used to configure the appliance, and one for the vRO application itself, used for orchestration of a vSphere (and beyond!) environment. I will show this using both PowerShell code, and the Postman GUI.

vRealize Orchestrator has a fairly straight forward API, you can access the documentation by opening your browser and going to ‘https://<vro_ip&gt;:8281/vco/api/docs’, you will be presented with this screen:

1

From here we can explore any area exposed by the API, for the sake of this example we’re going to do something simple, we are going to return a list of the available workflows in the vRO system. So select ‘Workflow Service’, and click ‘GET /workflows’, and we can see a bit of information about the REST call to list workflows. This is what it shows:

1

This being a ‘GET’ call, we don’t see a lot here, but we will run the call anyway, and see what we get back, in later articles we will go through changing configuration. First we will make the call using PowerShell, the script is as follows, this is pretty simple:

# Ignore SSL certificates
# NOTE: This should not really be required, if you are using
# proper certificates, and have working DNS
Add-type @"
using System.Net;
using System.Security.Cryptography.X509Certificates;
public class TrustAllCertsPolicy : ICertificatePolicy {
public bool CheckValidationResult(
ServicePoint srvPoint, X509Certificate certificate,
WebRequest request, int certificateProblem) {
return true;
}
}
"@
[System.Net.ServicePointManager]::CertificatePolicy = New-Object TrustAllCertsPolicy
# Define our credentials
$user_name = "vcoadmin";
$password = "vcoadmin";
$vro_server_name = "192.168.1.201"
# Convert the credentials to a format we can use in the
# header.
$auth = $user_name+":"+$password;
$Encoded = [System.Text.Encoding]::UTF8.GetBytes($auth);
$EncodedPassword = [System.Convert]::ToBase64String($Encoded)
$header = @{"Authorization"="Basic $($EncodedPassword)"};
# Define the URI for our REST call
$vro_getplugin_uri = "https://"+$vro_server_name+":8281/vco/api/plugins"
# Run the call and store the result in a variable
$get_plugins = Invoke-WebRequest -Uri $vro_getplugin_uri -Method Get -Headers $header -ContentType "application/json"

This returns the result of the call to a variable, unfortunately, although this is in JSON format, it will be horribly formatted if you try to output it:

1.png

If we run:

$get_plugins.content | ConvertFrom-Json | ConvertTo-Json

Then the built in JSON formatting will reformat it to make it readable:

1

This gives us a readable list that we can refer to, filter on, and do all the cool stuff that PowerShell makes simple.

We will now look at doing the same thing with Postman, so first you will need to install the Postman plugin from the Chrome App Store, then open it, you will see this:

1

We know our credentials and URI, so we’re just going to go ahead and step through making the request. First, enter the URI in the box:

1

Now we need to add our authentication method, so change ‘No Auth’ to ‘Basic Auth’, enter the credentials for the ‘vcoadmin’ user, and click ‘Update request’:

1

You will notice that ‘Headers (0)’ changed to ‘Headers(1)’, if you click on that you can see the header for the request, with the encoded credential:

1

Now click ‘Send’ and our REST call will be submitted, the same JSON we got earlier will be nicely formatted and displayed in the lower pane:

1

And that’s a REST call completed, with the result returned, and we know how to do it using PowerShell, to make programmatic calls, and with Postman, to explore and understand APIs.

Integrating Platform Services Controller/vCSA 6.0 with Active Directory using CLI

I am currently automating the build and configuration of a VMware vCenter environment, and the internet has been a great source of material in helping me with this, particularly William Lam and Brian Graf‘s websites. It seems VMware have done a great job with the changes in vSphere 6.0 of enabling automated deployments, this follows the general trends of the industry in driving automation and orchestration with everything we do in the Systems Administration world.

One thing I needed to do which I could not find any information on, was to join my standalone Platform Services Controller (PSC) to an AD domain, this is easy enough in the GUI, and is documented here. It was important for me to automate this however, so I trawled through the CLI on my PSC to figure out how to do this.

I stumbled across the following command which joins you to the AD domain of your choosing.

/usr/lib/vmware-vmafd/bin/vmafd-cli join-ad --server-name <server name> --user-name <user-name> --password <password> --machine-name <machine name> --domain-name <domain name>

Once this is completed the PSC will need restarting, to enable the change. This will add the PSC to Active Directory. The next challenge was finding a scripted method to add the identity source. Once the identity source is added, permissions can be set up as normal in vCenter using this identity source.

Again, I had to trawl through the PSC OS to find this, the script is as follows:

/usr/lib/vmidentity/tools/scripts/sso-add-native-ad-idp.sh <Native-Active-Dir-Domain-Name>

Both of these can be carried out through an SSH session to your PSC (or embedded PSC/VCSA server). Assuming you have BASH enabled on your PSC, you can invoke this remotely using the PowerCLI ‘Invoke-VMScript’ cmdlet. This should help in the process of fully automating the process of deploying a vCenter environment.

As an aside, one issue I did have, which is discussed in the VMware forums is that I was getting the error ‘Error while extracting local SSO users’ when enumerating users/groups from my AD in the VMware GUI, this was fixed by creating a PTR record in DNS in my domain for the Domain Controller, it seems this is needed by the new VMware SSO at some point.

I hope this is useful to people, and hopefully VMware will document this sort of automation in the future, in the meantime, as I said above, William Lam and Brian Graf’s sites are a good source of information.

Transparent Page Sharing – for better or worse

Transparent Page Sharing (TPS) is one of the cornerstones in memory management in the ESXi hypervisor, this is one of the many technologies VMware have developed, which allows higher VM consolidation ratios on hosts through intelligent analysis of VM memory utilisation, and deduplication based on this analysis.

There are two types of TPS, intra-VM and inter-VM. The first scans the memory in use by a single VM and deduplicates common block patterns, this will lower the total host memory consumption of a VM, without significantly impacting VM performance. The second type; inter-VM TPS, does the same process, looking for common blocks across memory usage for all VMs on a given host.

Historically this has been a successful reclamation technique, and has led to decent savings in host memory consumption, however, most modern Operating Systems seen in virtualised environments (most Linux distributions, Windows Server 2008 onwards) now use memory encryption by default so the chance of the TPS daemon finding common blocks becomes less and less likely.

If CPU integrated hardware-assisted memory virtualisation features (AMD Rapid Virtualisation Indexing (RVI) or Intel Extended Page Tables (EPT)) are utilised for an ESXi host, then the hypervisor will use 2MB block size for its TPS calculations, rather than the normal 4KB block size. Attempting to deduplicate in 2MB chunks is far more resource intensive, and far less successful, than running the same process in 4KB chunks, thus ESXi will not attempt to deduplicate and share large memory pages by default.

The upshot of this is that 2MB pages are scanned, and 4KB blocks within these large pages  hashed in preparation for inducing memory sharing should the host come under memory contention, in an effort to prevent sharing. Pre-hashing these 4KB chunks will mean that TPS is able to quickly react, deduplicating and sharing the pages should the need arise.

All good so far, this technique should help us to save a bit of memory, although since memory virtualisation features in modern CPUs are widespread, and larger amounts of host memory more common, the potential TPS savings should hopefully never be needed or seen.

At the back end of last year, VMware announced that they would be disabling TPS by default in ESXi following an academic paper which showed a potential security vulnerability in TPS which, if exploited, could result in sensitive data being made available from one VM to another utilising memory sharing. It should be noted that the technique used to exploit this is in a highly controlled laboratory style condition, and requires physical access to the host in question, and that the researcher highlighting it never actually managed to glean any data in this method.

Despite the theoretical nature of the published vulnerability, VMware took the precautionary approach, and so in ESXi 5.0, 5.1, 5.5 and now 6.0, with the latest updates, TPS is now disabled by default. What does this mean for the enterprise though, and what choices do we have?

1.Turn TPS on regardless – if you have a dedicated internal-only infrastructure, then it may be that you do not care about the risks exposed by the research. If you and your team of system administrators are the only ones with access to your ESXi servers, and the VMs within, as is common in many internal IT departments, then there are likely far easier ways to get access to sensitive data than by utilising this theoretical technique anyway

2.Turn off TPS – if you are in a shared Service Provider style infrastructure, or in an environment requiring the highest security, then this should be a no-brainer. The integrity and security of the client data you have on your systems should be foremost in your organisations mind, and in the interests of precaution, and good practice, you should disable TPS on existing systems and leave it off

This were the options presented by VMware until about a week ago, when an article was published which described a third option being introduced in the next ESXi updates:

3. Use TPS with guest VM salting – this allows selective inter-VM TPS enablement among sets of selected VMs, allowing you to limit the potential areas of vulnerability, using a combination of edits to .vmx files, and advanced host settings. This may be a good middle ground if you are reliant on the benefits provided by TPS, but security policy demands that it not be in the previous default mode

So these are our options, and regardless of which one you choose, you need to know what the difference in your environment will be if you turn off TPS, this is going to be different for everyone. Your current savings being delivered by TPS can be calculated, and this should give you some idea of what the change in host memory utilisation will be following the disabling of TPS.

The quickest way to see this for an individual host is via the Performance tab in vSphere Client, if you look at real time memory usage, and select the ‘Active’, ‘Shared’ and ‘Shared Common’ counters then you will be able to see how much memory is consumed in total by your host, and how much of this is being saved through TPS:

TPS_vSphere_Client

Here we can see:

TPS %age saving = (Shared – Shared common) / (Consumed – Used by VMkernel) * 100% = (5743984 – 112448) / (160286356 – 3111764) * 100% = 5631536 / 157174592 * 100% = 3.58%

So TPS is saving around 5.6GB or 3.6% of total memory on the host being consumed by VMs. This is a marker of the efficiency of TPS.

The same figures can be taken from esxtop if you SSH to an ESXi host, run esxtop, and press ‘m’ to get to memory view.

TPS_esxtop

Here we are looking at the PSHARE value, we can see the saving is 5607MB (ties up with above from the vSphere Client), and the memory consumed by VMs can be seen under PMEM/other, in this case 153104MB. Again we can calculate the percentage saving TPS is giving us by dividing the saving by the active memory and multiplying by 100%:

TPS %age saving = PSHARE saving / PMEM other * 100% = 5607 / 153104 * 100% = 3.66%

So this is how we can calculate the saving for each host, but what if you have dozens, or hundreds of hosts in your environment, wouldn’t it be great to get these stats for all your hosts? Well, the easiest way to get this kind of information is usually through PowerCLI so I put the following script together:


# Ask for connection details, then connect using these
$vcenter = Read-Host "Enter vCenter Name or IP"


# Set up our constants for logging
$datetime = get-date -uformat "%C%y%m%d-%H%M"
$OutputFile = ".\" + $datetime + "_" + $vcenter + "_TPS_Report.csv"

# Connect to vCenter
$Connection = Connect-VIServer $vcenter

$myArray = @()

forEach ($Cluster in Get-Cluster) {
foreach($esxhost in ($Cluster | Get-VMHost | Where { ($_.ConnectionState -eq "Connected") -or ($_.ConnectionState -eq "Maintenance")} | Sort Name)) {
$vmdetails = "" | select hostname,clustername,memsizegb,memshavg,memshcom,tpssaving,percenttotalmemsaved,tpsefficiencypercent
$vmdetails.hostname = $esxhost.name
$vmdetails.clustername = $cluster.name
$hostmem = Get-VMHost $esxhost | Select -exp memorytotalgb
$vmdetails.memsizegb = "{0:N0}" -f $hostmem
$vmdetails.memshavg = [math]::Round((Get-VMhost $esxhost | Get-Stat -Stat mem.shared.average -MaxSamples 1 -Realtime | Select -exp value),2)
$vmdetails.memshcom = [math]::Round((Get-VMhost $esxhost | Get-Stat -Stat mem.sharedcommon.average -MaxSamples 1 -Realtime | Select -exp value),2)
$vmdetails.tpssaving = $vmdetails.memshavg-$vmdetails.memshcom
$vmdetails.percenttotalmemsaved = [math]::Round(([int]$vmdetails.tpssaving/([int]$vmdetails.memsizegb*1024*1024))*100,2)
$consumedmemvm = [math]::Round(((Get-VMhost $esxhost | Get-Stat -Stat mem.consumed.average -MaxSamples 1 -Realtime | Select -exp value)-(Get-VMhost $esxhost | Get-Stat -Stat mem.sysUsage.average -MaxSamples 1 -Realtime | Select -exp value)),2)
$vmdetails.tpsefficiencypercent = [math]::Round(([int]$vmdetails.tpssaving/$consumedmemvm)*100,2)
$myArray += $vmdetails
}
}
Disconnect-VIServer * -Confirm:$false

$myArray | Sort Name | Export-Csv -Path $Outputfile

This script will dump out a CSV with every host in your vCenter, and tell you the percentage of total host memory saved by TPS, and the efficiency of TPS in your environment. This should help to provide some idea of what the impacts of TPS being turned off will be.

Ultimately, your organisation’s security policies should define what to do after the next ESXi updates, and how you should act in the meantime, TPS is definitely a useful feature, and does allow for higher consolidation ratios, but security vulnerabilities should not be ignored. Hopefully this post will give you an idea of how TPS is currently impacting your infrastructure.

Ballooning – the lowdown

VMware Tools does many things for us as administrators, automating much of the resource management and monitoring we need so we can have confidence in managing the sprawling clusters of a private cloud.

Below are some of the benefits VMware Tools affords us:

  • Driver optimisation
  • Power settings management
  • Time synchronisation
  • Advanced memory management
  • VM Heartbeating

VMware Tools is available for every supported guest Operating System, and with the exception of certain appliances (which often come with 3rd Party tools installed anyway), installation of VMTools should be common practice.

In this article I am going to talk about one specific memory management technique: ballooning. Ballooning is a method for reclaiming host memory in times of contention, this allows for more workloads to run on a host without resorting to using swapping.

The VMTools service/daemon runs as any other process does in an Operating System, and can request resources the same as any other processes. When memory contention is high on a host, and VM allocated memory would likely need to be swapped to disk (the .vswp file, see here for more information on that), the hypervisor will send a request to the VMTools process running on its guest VMs to try and reclaim memory.

Often, software running in an OS will not release memory when it is done with it, this can lead to memory being tied up in the OS, and can therefore not be released to the hypervisor to return to the pool of available host memory.

When the VMTools process receives the signal, it will request up to 65% of the host memory be relinquished to it, at which time it will release the memory back to the hypervisor. Because of the way we allocate memory to VMs (vRAM), the VM often has more memory available to it than it requires and as such, most VMs will have memory which can be returned.

Ballooning is a normal part of the automated memory management processes which ESXi provides, high amounts of ballooning do not usually indicate a problem, although large amounts of ballooning activity could tell you that memory is overcommitted.

The only time ballooning can cause a problem is when the ballooning driver reclaims memory pages required by the guest OS, in this case it can lead to swapping which could lead to performance degradation.

tl;dr – ballooning is a normal part of vSphere’s memory management, assisting in pushing up consolidation ratios. Don’t worry about it, but it is good to be aware of what it does, and how it works.

VM Swap File location considerations

When a VM is powered on, a .vswp (Virtual Machine swap) file is created (note there is also a vmx swap file which gets created in the same location as the VM, this is seperate from this discussion, but is described here), its size is the memory allocation for the VM less any reservation configured. If there is not sufficient space in the configured swap file location to create this file then the VM will not power on. The use of this file for memory pages is a last resort, and will be considerably slower than using normal memory, even if this is compressed or shared. It will always be possible that you get in a situation where memory contention is occurring, and that the use of the swap file begins, to prepare for this a system design should consider the location of swap files for VMs. Below I discuss some of the considerations which should be made when placing a VM swap file:

  • Default location for swap file is to store it in the same datastore as the VM, this presents the following problems:
    • Performance – it is unlikely that the datastore the VM sits in is on top tier storage, or limited to a single VM. This means that the difference in speed between memory IO and the swapping IO once contention occurs will be great, and that the additional IO this swapping produces could well impact other workloads on the datastore. If there are multiple VMs sharing this datastore, and all running on the host with memory contention issues, then this will be further compounded and could see the datastore performance plummet
    • Capacity – inevitably, administrators will keep chucking workloads into a datastore, and unless Storage DRS with datastore clusters is being used, or the administrators are pro-active in balancing storage workloads, there will come a time when a VM will not power up due to insufficient space to create the .vswp file. This is particularly likely after a change to the VM configuration such as adding more disk or memory

VM swap file location can be changed either at the cluster, or host level. When choosing to move this from the default, the following should be considered:

  • Place swap files on the fastest storage possible – if you can place this on flash storage then fantastic, this will not be as quick as paging to/from memory, but it will be many magnitudes better than placing it on spinning disk
  • Place swap files as close to the host as possible – the latency incurred by traversing your SAN/IP network to get to shared storage will all impair guest performance when swapping occurs. Be aware that although the default location can be changed to host local storage (which will probably give the best performance of the host has internal flash storage), this will impair vMotion performance massively, as the entire .vswp file would need to be copied from the source host to the destination host’s disk during the vMotion activity
  • Do not place the .vswp on replicated storage – as with location selection for guest OS swap files, there is no point on placing the file on replicated storage; if the VM is reset or powered off then this file is deleted. If your VMs are on storage which is replicated as part of its standard capability then the .vswp files should definitely be located elsewhere

 

In terms of configuring the location, as stated above, this is set at either a VM, host or cluster level, if this is inconsistent across hosts in a cluster then again this may impact vMotion times as the VM migrates from a host with one configured location to another with a different location. As with most settings which can be made at the cluster level, consistency should be maintained across the cluster unless this is not possible. Bear in mind though, that having vswp consistent across the cluster, and defined to be a single datastore, could lead to high IOPS on this datastore should cluster wide memory contention occur., especially with large clusters.

As stated at the beginning of this article, swap files are sized based on VM memory allocation less reservations size. By right sizing VMs, and utilising reservations, swap file sizes, and usage, can be kept to a minimum, and these planning considerations should take precedence over all others. Hopefully memory contention will never be so bad that swap will be required, but when the day does come it is good to be prepared, by making informed, and reasoned decisions early on.

VMFS Extents – to extend or not to extend, that is the question

Extents allow disk presented to a vSphere system to be added to VMFS datastore to extend the file system, this aggregates multiple disks together and can be useful in a number of scenarios. Recently I saw problems where extents were being used spanning two storage systems; one of the storage systems had a controller failure which caused SCSI reservation issues on one of the LUNs making up the extent and this caused the entire datastore to go offline.

In this article I want to discuss some of the benefits and potential pitfalls in using VMFS extents in vSphere environments. Ultimately this is an available, supported, and sometimes useful feature of vSphere but there are some limitations or weaknesses that using this can bring.

Advantages:

  • Using extents allows you to create datastores up to the maximum supported by vSphere for pre-VMFS-5 datastores. It can be useful to create large datastores for the following reasons:
    • There may be a requirement to natively present a VMDK which is larger than the maximum LUN size available on your storage system. For example, if 2TB is the largest LUN you can present, but you need a 4TB disk for the application your VM is hosting, the aggregation of disks will allow the creation of a VMFS datastore large enough to deliver this without the need to span volumes in the guest OS, or the need to fallback to using something like RDMs which may impinge on other vSphere functionality
    • Datastore management will be simplified with fewer VMFS datastores required. The fewer datastores available, the less an administrator has to keep their eyes on. In addition to this, decisions made in placing VMs is made considerably simpler if there are fewer choices
  • Adding space to a datastore with capacity issues; in a previous role we were constrained by storage space more than any other resource, this meant that on both the storage system (NetApp FAS2050 with a single shelf of storage), and at the VMFS level, the design left little to no room to extend a VMDK should it be required. If we did need to add space to VMDKs, we had to extend the volume and LUN by the required amount on the filer, and add a small extent to the datastore in vSphere

Disadvantages:

  • Introduces a single point of failure; whether you are aggregating disks from one or multiple storage systems, by adding extents to a volume the head extent in the aggregated datastore (the first LUN added to the datastore) becomes a single point of failure, if any of the LUNs should become unavailable then VMs which have any blocks whatsoever on the lost LUN will no longer be available
  • Management from the storage side can become more difficult, given that there may be multiple LUNs, from multiple storage systems now aggregated to form a single datastore, from a storage side it is harder to identify which LUNs relate to which datastores in vSphere, to combat this it is important to document the addition of extents well, and label LUNs accordingly on the storage system
  • If extents are combined which span different storage devices then there may well be a loss in performance

The above is all just based on my experiences, but it seems there are legitimate use cases for choosing to use, or not use extents. My personal preference would be to present a new larger LUN where possible, formatting this in VMFS, and using Storage vMotion to migrate VMs to the new datastore. Given that since VMFS-5 introduced GPT as the partitioning method for LUNs, we can now create single extent datastores up to 64TB in size, the requirement for using extents should be diminished. There are often legitimate reasons, especially in older environments, why this is not practical or possible however, and in these cases using extents is perfectly valid.