Incident Report
Summary
On Tuesday, 29 October 2024 at 16:21 US/Pacific, a power voltage swell event impacted the Network Point of Presence (POP) infrastructure in one of the campuses supporting the australia-southeast2 region, causing networking devices to unexpectedly reboot. The impacted network paths were recovered post reboot and connectivity was restored at 16:43 US/Pacific.
Incident began at 2024-10-29 16:21 and ended at 2024-10-29 16:43 (all times are US/Pacific).
Impact
GCP operates multiple PoPs in the australia-southeast2 region to connect Cloud zones to each other and to Google’s Global Network. Google maintains sufficient capacity to ensure that occasional failures of network capacity are not noticeable and/or have minimal disruption to customers.
To our Google Cloud customers whose businesses were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer, and we are taking immediate steps to improve the platforms’ performance and resilience.
Then later that same day, the PoP where 50% of the network capacity for the region is hosted, experienced a power voltage swell event causing the networking devices in the PoP to reboot. When the networking devices in this datacenter rebooted, this networking capacity was temporarily unavailable.
On Tuesday, 29 October 2024, two fiber failures had occurred near this region. These failures reduced the available inter-region network capacity, but did not impact the availability of GCP services in the region.
Google engineers were already working on remediating the two fiber failures when they were alerted to the network disruption caused by the voltage swell via an internal monitoring alert on Tuesday, 29 October 2024 16:29 US/Pacific and immediately started an investigation.
This rare triple failure resulted in reduced connectivity across zones in the region and caused multiple Google Cloud services to lose connectivity for 21 minutes. Additionally, customers using Zones A and Zones C, experienced up to 15% increased latency and error rates for 16 minutes due to degraded inter-zone network connectivity while the networking devices recovered from the reboot .
Majority of GCP services impacted by the issue recovered shortly thereafter. A few Cloud services experienced longer restoration times as manual actions were required in some cases to complete full recovery.
After the devices had completed rebooting, the impacted network paths were recovered and all affected Cloud Zones regained full connectivity to the network at 16:43 US/Pacific.
Root cause of device reboots
Upon review of the power management systems, Google engineers have determined there is a mismatch in voltage tolerances in the affected network equipment on the site. The affected network racks are currently designed to tolerate up to ~110% of designed voltage, but the UPS which supplies the power to the network equipment is designed to tolerate up to ~120% of designed voltage.
(Updated on 04 December 2024)
We have determined that a voltage regulator for the datacenter-level UPS was enabled, which is a standard setting. This caused an additional boost during power fluctuations, pushing the voltage into the problematic 110%-120% range. The voltage regulator is necessary to ensure sufficient voltage when loads are high, but because the equipment is well below its load limit, and caused a deviation above expected levels.
The voltage swell event caused a deviation between 110% and 120% which was detected by the networking equipment’s rectifiers as outside their allowable range and they powered down in order to protect the equipment.
- Review our datacenter power distribution design in the region and implement any recommended additional protections for critical networking devices against voltage swells and sags.
- As an initial risk reduction measure, the datacenter-level UPS voltage regulator will be reconfigured, and we have instituted monthly reviews to ensure it will be configured to correctly match future site load.
- We are also deploying a double conversion UPS in the affected datacenter’s power distribution design for the equipment that failed during this event.
- Implement changes to network device configuration that reduce time to recover full network connectivity after a failure.
- Review and verify that this risk does not exist in other locations, and if so proactively perform the above remediations.
- Root cause and determine corrective actions for GCP Services that did not recover quickly from the incident after the network connectivity was restored.
To summarize, the root cause investigation for the network device reboots impacting the australia-southeast2 region has been concluded. Google teams are implementing additional preventative measures to minimize the risk of recurrence.
Remediation and Prevention
We are taking the following actions to prevent a recurrence and improve reliability in the future:
Affected products: Apigee, Apigee Edge Public Cloud, Batch, Cloud Filestore, Cloud Firestore, Cloud Key Management Service, Cloud NAT, Cloud Run, Google BigQuery, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Deploy, Google Cloud Networking, Google Cloud Pub/Sub, Google Cloud SQL, Google Cloud Storage, Google Compute Engine, Google Kubernetes Engine, Identity and Access Management, Persistent Disk, Resource Manager API, Virtual Private Cloud (VPC)
This is the final version of the Incident Report.