Google Cloud – RESOLVED: We are investigating elevated error rates and latency for streaming ingestion into BigQuery

Mini Incident Report

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support or to Google Workspace Support using help article https://support.google.com/a/answer/1047213.

Incident began at 2024-12-09 09:33 and ended at 2024-12-09 11:50 (all times are US/Pacific).

Incident Start: 9 December 2024 09:24

(All Times US/Pacific)

Duration: 2 hours, 24 minutes

Incident End: 9 December 2024 11:40

Google BigQuery
Cloud Dataflow

Affected Services and Features:

  • Google BigQuery – US multi-region
  • Cloud Dataflow – us-west1, us-east1, us-east4, us-west2 & us-central1 were the most impacted but all Dataflow pipelines writing to the BigQuery US multi-region have likely been impacted too.

Description:

Regions/Zones:

Google will complete a full IR in the following days that will provide a full root cause.

Google BigQuery experienced increased latency and elevated error rates in US multi-region for a duration of 2 hours, 24 minutes. Cloud Dataflow customers also observed elevated latency in their streaming jobs to the BigQuery US multi-region. Preliminary analysis indicates that the root cause of the issue was a sudden burst of traffic, which overloaded and slowed the backend in the availability zone. This led to aggressive retries and overloaded the frontend service. The incident was mitigated by rate-limiting requests and by evacuating the slow availability zone.

Google BigQuery

  • During the incident, customers calling google.cloud.bigquery.v2.TableDataService.InsertAll API method may have experienced transient failures with 5XX status code, which should have succeeded after retries.
  • Customers using google.cloud.bigquery.storage.v1.AppendRows may have experienced increased latency during this incident.

Cloud Dataflow

  • Customers would have experienced increased latency for streaming jobs in the us-east1, us-east4, us-west1, us-west2, and us-central1 regions.

Affected products: Google BigQuery, Google Cloud Dataflow

Customer Impact:

Affected locations: Multi-region: us, Iowa (us-central1), South Carolina (us-east1), Northern Virginia (us-east4), Oregon (us-west1), Los Angeles (us-west2)

Leave a Comment