Start Time (UTC): October 24, 2019, 22:20 hours UTC
End Time (UTC): October 25, 2019, 22:25 hours UTC
Duration: 24hrs 5mins
VMware Cloud on AWS (VMC) experienced an operational issue that caused us to inadvertently remediate hosts that had not actually failed. Customers may have noticed an unusual amount of host replacement activity due to this error.
Users may have seen multiple hosts added to their SDDCs which was then followed by an equal number of hosts being removed. Because hosts are always added and removed in pairs when remediation is performed. The net effect is that the SDDC eventually returned to it’s original size and existing workloads should not be impacted. All hosts were placed into maintenance mode prior to their removal and no hosts with running VMs were removed from the SDDC. However, the SDDC may have experienced higher than normal levels of vMotion traffic as workloads were rebalanced across these new hosts. As per our normal policy, customers are not billed for host maintenance and this activity will not affect customer bills.
VMC has a monitoring system that monitors the health of an SDDC and sends these events to a host remediation service in VMC. This system makes a decision to react to the event based on the health of the underlying host. On 10/24/2019 there was a monitoring agent service update to the fleet. This resulted in a significantly large number of events being sent to the service in a short span of time. With the high volume of events, the service in some cases could not determine the host health in time and decided to error on the side of caution and added a new host to maintain customer SLA. In the majority of the cases, the service was able to correctly determine the host health and removed the new host leaving the original host intact.