When using the VMware vSphere High Availability feature we have the option to set some advanced options, including setting the failure detection time and adding more isolation addresses. There is also some official recommendations from VMware regarding this that surprisingly might not be entirely correct.
This article will assume that you are reasonable familiar with HA and we will only look at some of the advanced settings in regard of the “Isolation Response”. As you probably knows the ESX/ESXi hosts will use the management network for HA heartbeat traffic and to detect host failures for partners. If the management network breaks (for any reason) a host could be still alive, but isolated from the rest of the HA hosts. After a certain time of not hearing from any partners the host will try to verify a suspected isolation by pinging a pre-defined test address. If no answer is received, it will perform the also pre-configured Isolation Response action. This setting is very important, but we will not discuss these options here, but only look at the timing of the isolation tests.
So when does this isolation control ping take place? The value that governs this is an advanced setting called das.failuredetectiontime, which by default is 15000 ms, i.e. 15 seconds.
If we assume that a host has suffered from a network failure, but continues to run, then this will happen:
Time: das.failuredetectiontime – 3 (default at the 12th second): A host which hears from no other partners during this number of seconds will suspect it is isolated.
Time: das.failuredetectiontime – 2 (default at 13th second): The host will ping the isolation test address, which by default is the gateway of the management network.
Time: das.failuredetectiontime – 1 (default at 14th second): If the host gets no response from the isolation test address the Isolation Response action will take place. The default Isolation Response on vSphere 4.x is to shut down all local virtual machines.
If needed you can change the test isolation address (default is the gateway of the management network) to some other IP address and perhaps add more IPs to ping than just one. This is possible through the advanced options das.usedefaultisolationaddress, das.isolationaddress1 and das.isolationaddress2. When having multiple isolation addresses all will be pinged and if any responds the host will not declare itself as isolated.
What is a bit curious about this is that VMware recommends that if setting multiple Isolation Test addresses the das.failuredetectiontime should be increased from 15000 to at least 20000 ms, see above. The reason for this is said to give more room for the ping tests to take place.
“Change the default failure detection time to 20 seconds or greater. In general the more isolation response addresses configured, the longer you should make the timeout to ensure that proper failure detection can occur.”
This sounds reasonable at first. However, will this increase actually be of any use?
If we call the das.failuredetectiontime for X, then it will always be:
X – 2 seconds: host will ping the isolation address, one or several
X – 1 second: host will do the isolation response action if no icmp echo reply arrives
This means that it will not matter what X is set to. Changing X (das.failuredetectiontime) from 15000 ms to 20000, 30000 or 60000 will never give the host “more time” to ping more isolation addresses since it is relative to the value of das.failuredetectiontime and always do the pings at X – two seconds. We will basically just wait longer before testing the isolation addresses, but does not get any more time to do the real tests.
This means there is actually no point in increasing this value (for the reason of having multiple isolation addresses).
Update: After I raised a question on this contradictory recommendation on the VMTN, HA expert Duncan Epping from VMware replied that he should verify this with the engineering team and later confirmed that the recommendations indeed were incorrect. Probably had several different situations with different suitable settings been mixed up at some point in time.