VMware Cloud on AWS: Host Remediation Degradation
Incident Report for VMware Cloud Services
Postmortem

Start Time (UTC): October 24, 2019, 22:20 hours UTC
End Time (UTC): October 25, 2019, 22:25 hours UTC
Duration: 24hrs 5mins

Incident Summary:

VMware Cloud on AWS (VMC) experienced an operational issue that caused us to inadvertently remediate hosts that had not actually failed. Customers may have noticed an unusual amount of host replacement activity due to this error.

Impact Summary:
Users may have seen multiple hosts added to their SDDCs which was then followed by an equal number of hosts being removed. Because hosts are always added and removed in pairs when remediation is performed. The net effect is that the SDDC eventually returned to it’s original size and existing workloads should not be impacted. All hosts were placed into maintenance mode prior to their removal and no hosts with running VMs were removed from the SDDC. However, the SDDC may have experienced higher than normal levels of vMotion traffic as workloads were rebalanced across these new hosts. As per our normal policy, customers are not billed for host maintenance and this activity will not affect customer bills.

Root Cause:
VMC has a monitoring system that monitors the health of an SDDC and sends these events to a host remediation service in VMC. This system makes a decision to react to the event based on the health of the underlying host. On 10/24/2019 there was a monitoring agent service update to the fleet. This resulted in a significantly large number of events being sent to the service in a short span of time. With the high volume of events, the service in some cases could not determine the host health in time and decided to error on the side of caution and added a new host to maintain customer SLA. In the majority of the cases, the service was able to correctly determine the host health and removed the new host leaving the original host intact.

Posted Oct 31, 2019 - 17:15 UTC

Resolved
VMware engineering has completed all false Host Remediation activities and the incident has been resolved.

Impact : None

Start Time: October 24, 2019, 22:20 hours UTC
End Time: October 25, 2019, 22:25 hours UTC
Posted Oct 25, 2019 - 23:09 UTC
Update
We are continuing to work on a fix for this issue.

Impact : User may have seen multiple hosts added to their SDDCs unnecessarily. Existing Workload is not impacted.

Start Time: October 24, 2019, 22:20 hours UTC
END Time: N/A
Posted Oct 25, 2019 - 16:53 UTC
Identified
VMware Engineering teams are in the process of removing inadvertently added hosts.

Impact : User may have seen multiple hosts added to their SDDCs unnecessarily. Existing Workload is not impacted.

Start Time: October 24, 2019, 22:20 hours UTC
END Time: N/A
Posted Oct 25, 2019 - 16:35 UTC
Investigating
Please be aware that we have experienced an operational issue that caused us to inadvertently remediate hosts that had not actually failed. A small subset of SDDC’s may have unnecessarily seen hosts being added. Please note that we are aware of this issue and we have stopped this from occurring. As per our normal policy, customers are not billed for host maintenance and this activity will not affect customer bills.

Impact : User may have seen multiple hosts added to their SDDCs unnecessarily. Existing Workload is not impacted.

Start Time: October 24, 2019, 22:20 hours UTC
END Time: N/A
Posted Oct 25, 2019 - 16:32 UTC
This incident affected: VMware Cloud on AWS (VMware Cloud on AWS).