SendGrid – Mail send resulting in the following error ‘We cannot allow you to access SendGrid because you appear to be located in an embargoed region’ error

Dec 4, 21:58 PST
Resolved – Our engineers have monitored the fix and confirmed the issue has been resolved. All services are now operating normally at this time.

Dec 4, 21:44 PST
Update – We are continuing to monitor for any further issues.

Dec 4, 21:35 PST
Monitoring – Starting around 01:23 PM PST on 12-04-2024, our engineers began investigating an issue with mail send. Few customers may have experienced the following error ‘We cannot allow you to access SendGrid because you appear to be located in an embargoed region’ when trying to send emails. Our engineers are implementing a fix and are monitoring system performance.

Rovo – AI features degraded

Dec 5, 02:38 UTC
Resolved – Between 12:13 AM UTC to 01:30 AM UTC, we experienced a degraded experience for AI features in Confluence, Jira Service Management, Jira, and Rovo. The issue has been resolved and the service is operating normally.

Dec 5, 01:32 UTC
Investigating – We are investigating an issue with AI features that is impacting all Rovo and Enterprise and Premium Confluence, Jira Service Management and Jira customers. We will provide more details within the next hour.

Jira Service Management – AI features degraded

Dec 5, 02:38 UTC
Resolved – Between 12:13 AM UTC to 01:30 AM UTC, we experienced a degraded experience for AI features in Confluence, Jira Service Management, Jira, and Rovo. The issue has been resolved and the service is operating normally.

Dec 5, 01:32 UTC
Investigating – We are investigating an issue with AI features that is impacting all Rovo and Enterprise and Premium Confluence, Jira Service Management and Jira customers. We will provide more details within the next hour.

Confluence – AI features degraded

Dec 5, 02:38 UTC
Resolved – Between 12:13 AM UTC to 01:30 AM UTC, we experienced a degraded experience for AI features in Confluence, Jira Service Management, Jira, and Rovo. The issue has been resolved and the service is operating normally.

Dec 5, 01:32 UTC
Investigating – We are investigating an issue with AI features that is impacting all Rovo and Enterprise and Premium Confluence, Jira Service Management and Jira customers. We will provide more details within the next hour.

Jira – AI features degraded

Dec 5, 02:38 UTC
Resolved – Between 12:13 AM UTC to 01:30 AM UTC, we experienced a degraded experience for AI features in Confluence, Jira Service Management, Jira, and Rovo. The issue has been resolved and the service is operating normally.

Dec 5, 01:32 UTC
Investigating – We are investigating an issue with AI features that is impacting all Rovo and Enterprise and Premium Confluence, Jira Service Management and Jira customers. We will provide more details within the next hour.

OpenAI – API & ChatGPT Performance Degradation

Dec 4, 18:21 PST
Resolved – This incident has been resolved.

Dec 4, 18:00 PST
Update – We are continuing to monitor for any further issues. Please contact support via help.openai.com if any issues persist with ChatGPT or the API.

Dec 4, 17:49 PST
Update – The issue has been mitigated and we are continuing to monitor and verify that the entire system is returning to full operation.

Dec 4, 16:29 PST
Update – The issue has reappeared and may be affecting both API and ChatGPT. We are investigating the issue.

Dec 4, 16:10 PST
Update – We are continuing to monitor for any further issues.

Dec 4, 16:08 PST
Monitoring – We experienced a brief period of degraded API performance from approximately 3:45 PM to 3:50 PM PT. Performance has stabilized and we are currently monitoring. We will post an update once we have confirmed the issue has been fully resolved.

GitLab – GitLab-hosted runners with the gitlab-org-docker tag are offline

December 4, 2024 19:38 UTC
Investigating – Jobs tagged with the “gitlab-org-docker” tag are stuck in a “Pending” status as the runners are currently offline. Please see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18945 for further details.

December 4, 2024 19:53 UTC
Investigating – The “gitlab-org-docker” tag is meant for gitlab-org projects only and not for customer workloads. As a preliminary potential fix, please remove the tag from your affected jobs and retry them. See: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18945

December 4, 2024 20:10 UTC
Investigating – We continue to investigate our Runner infrastructure to determine the cause of the issue. Please review https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18945 for full details.

December 4, 2024 20:46 UTC
Investigating – We have found traces of connectivity issues in our Runner network infrastructure. We continue our investigation. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18945 for details.

December 4, 2024 21:10 UTC
Investigating – We have potentially identified the commit that caused this disruption in the Runner network. Investigation continues. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18945 for details.

December 4, 2024 21:29 UTC
Monitoring – We have pushed a potential fix and see signs of recovery from the affected Runners. We will continue to monitor this to ensure jobs are properly picked up. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18945.

December 4, 2024 22:19 UTC
Monitoring – Runner performance metrics are back to normal levels and jobs are being properly picked up. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18945 for the full details.

December 4, 2024 22:20 UTC
Resolved – This incident is now resolved. Please make sure not to use the “gitlab-org-docker” tag for your workloads as they are intended for gitlab-org projects only. See https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18945.

Google Cloud – RESOLVED: All the impacted GCP products in australia-southeast2 have recovered.

Incident Report

Summary

On Tuesday, 29 October 2024 at 16:21 US/Pacific, a power voltage swell event impacted the Network Point of Presence (POP) infrastructure in one of the campuses supporting the australia-southeast2 region, causing networking devices to unexpectedly reboot. The impacted network paths were recovered post reboot and connectivity was restored at 16:43 US/Pacific.

Incident began at 2024-10-29 16:21 and ended at 2024-10-29 16:43 (all times are US/Pacific).

Impact

GCP operates multiple PoPs in the australia-southeast2 region to connect Cloud zones to each other and to Google’s Global Network. Google maintains sufficient capacity to ensure that occasional failures of network capacity are not noticeable and/or have minimal disruption to customers.

To our Google Cloud customers whose businesses were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer, and we are taking immediate steps to improve the platforms’ performance and resilience.

Then later that same day, the PoP where 50% of the network capacity for the region is hosted, experienced a power voltage swell event causing the networking devices in the PoP to reboot. When the networking devices in this datacenter rebooted, this networking capacity was temporarily unavailable.

On Tuesday, 29 October 2024, two fiber failures had occurred near this region. These failures reduced the available inter-region network capacity, but did not impact the availability of GCP services in the region.

Google engineers were already working on remediating the two fiber failures when they were alerted to the network disruption caused by the voltage swell via an internal monitoring alert on Tuesday, 29 October 2024 16:29 US/Pacific and immediately started an investigation.

This rare triple failure resulted in reduced connectivity across zones in the region and caused multiple Google Cloud services to lose connectivity for 21 minutes. Additionally, customers using Zones A and Zones C, experienced up to 15% increased latency and error rates for 16 minutes due to degraded inter-zone network connectivity while the networking devices recovered from the reboot .

Majority of GCP services impacted by the issue recovered shortly thereafter. A few Cloud services experienced longer restoration times as manual actions were required in some cases to complete full recovery.

After the devices had completed rebooting, the impacted network paths were recovered and all affected Cloud Zones regained full connectivity to the network at 16:43 US/Pacific.

Root cause of device reboots

Upon review of the power management systems, Google engineers have determined there is a mismatch in voltage tolerances in the affected network equipment on the site. The affected network racks are currently designed to tolerate up to ~110% of designed voltage, but the UPS which supplies the power to the network equipment is designed to tolerate up to ~120% of designed voltage.

(Updated on 04 December 2024)

We have determined that a voltage regulator for the datacenter-level UPS was enabled, which is a standard setting. This caused an additional boost during power fluctuations, pushing the voltage into the problematic 110%-120% range. The voltage regulator is necessary to ensure sufficient voltage when loads are high, but because the equipment is well below its load limit, and caused a deviation above expected levels.

The voltage swell event caused a deviation between 110% and 120% which was detected by the networking equipment’s rectifiers as outside their allowable range and they powered down in order to protect the equipment.

  • Review our datacenter power distribution design in the region and implement any recommended additional protections for critical networking devices against voltage swells and sags.
  • As an initial risk reduction measure, the datacenter-level UPS voltage regulator will be reconfigured, and we have instituted monthly reviews to ensure it will be configured to correctly match future site load.
  • We are also deploying a double conversion UPS in the affected datacenter’s power distribution design for the equipment that failed during this event.
  • Implement changes to network device configuration that reduce time to recover full network connectivity after a failure.
  • Review and verify that this risk does not exist in other locations, and if so proactively perform the above remediations.
  • Root cause and determine corrective actions for GCP Services that did not recover quickly from the incident after the network connectivity was restored.

To summarize, the root cause investigation for the network device reboots impacting the australia-southeast2 region has been concluded. Google teams are implementing additional preventative measures to minimize the risk of recurrence.

Remediation and Prevention

We are taking the following actions to prevent a recurrence and improve reliability in the future:



Affected products: Apigee, Apigee Edge Public Cloud, Batch, Cloud Filestore, Cloud Firestore, Cloud Key Management Service, Cloud NAT, Cloud Run, Google BigQuery, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Deploy, Google Cloud Networking, Google Cloud Pub/Sub, Google Cloud SQL, Google Cloud Storage, Google Compute Engine, Google Kubernetes Engine, Identity and Access Management, Persistent Disk, Resource Manager API, Virtual Private Cloud (VPC)

This is the final version of the Incident Report.

Affected locations: Melbourne (australia-southeast2)

GitHub – Disruption with some GitHub services

Dec 4, 19:27 UTC
Resolved – On December 4th, 2024 between 18:52 UTC and 19:11 UTC, several GitHub services were degraded with an average error rate of 8%.

The incident was caused by a change to a centralized authorization service that contained an unoptimized database query. This led to an increase in overall load on a shared database cluster, resulting in a cascading effect on multiple services and specifically affecting repository access authorization checks. We mitigated the incident after rolling back the change at 19:07 UTC, fully recovering within 4 minutes.

While this incident was caught and remedied quickly, we are implementing process improvements around recognizing and reducing risk of changes involving high volume authorization checks. We are investing in broad improvements to our safe rollout process, such as improving early detection mechanisms.

Dec 4, 19:26 UTC
Update – Pull Requests is operating normally.

Dec 4, 19:21 UTC
Update – Pull Requests is experiencing degraded performance. We are continuing to investigate.

Dec 4, 19:20 UTC
Update – Issues is operating normally.

Dec 4, 19:18 UTC
Update – API Requests is operating normally.

Dec 4, 19:17 UTC
Update – Webhooks is operating normally.

Dec 4, 19:11 UTC
Update – We have identified the cause of timeouts impacting users across multiple services. This change was rolled back and we are seeing recovery. We will continue to monitor for complete recovery.

Dec 4, 19:07 UTC
Update – Issues is experiencing degraded performance. We are continuing to investigate.

Dec 4, 19:05 UTC
Update – API Requests is experiencing degraded performance. We are continuing to investigate.

Dec 4, 19:05 UTC
Update – Webhooks is experiencing degraded performance. We are continuing to investigate.

Dec 4, 18:58 UTC
Investigating – We are currently investigating this issue.