GitHub – Disruption with some GitHub services

Dec 20, 16:44 UTC
Resolved – This incident has been resolved.

Dec 20, 16:42 UTC
Update – This issue is related to a partner who is working the problem, they in partial recovery.

Dec 20, 16:20 UTC
Update – We’re seeing issues related to some of our marketing pages. We are investigating.

Dec 20, 16:18 UTC
Investigating – We are currently investigating this issue.

GitHub – Live updates on pages not loading reliably

Dec 17, 16:00 UTC
Resolved – On December 17th, 2024, between 14:33 UTC and 14:50 UTC, many users experienced intermittent errors and timeouts when accessing github.com. The error rate was 8.5% on average and peaked at 44.3% of requests. The increased error rate caused a broad impact across our services, such as the inability to log in, view a repository, open a pull request, and comment on issues. The errors were caused by our web servers being overloaded as a result of planned maintenance that unintentionally caused our live updates service to fail to start. As a result of the live updates service being down, clients reconnected aggressively and overloaded our servers.

We only marked Issues as affected during this incident despite the broad impact. This oversight was due to a gap in our alerting while our web servers were overloaded. The engineering team’s focus on restoring functionality led us to not identify the broad scope of the impact to customers until the incident had already been mitigated.

We mitigated the incident by rolling back the changes from the planned maintenance to the live updates service and scaling up the service to handle the influx of traffic from WebSocket clients.

We are working to reduce the impact of the live updates service’s availability on github.com to prevent issues like this one in the future. We are also working to improve our alerting to better detect the scope of impact from incidents like this.

Dec 17, 15:32 UTC
Update – Issues is operating normally.

Dec 17, 15:29 UTC
Update – We have taken some mitigation steps and are continuing to investigate the issue. There was a period of wider impact on many GitHub services such as user logins and page loads which should now be mitigated.

Dec 17, 15:05 UTC
Update – Issues is experiencing degraded availability. We are continuing to investigate.

Dec 17, 14:53 UTC
Update – We are currently seeing live updates on some pages not working. This can impact features such as status checks and the merge button for PRs.

Current mitigation is to refresh pages manually to see latest details.

We are working to mitigate this and will continue to provide updates as the team makes progress.

Dec 17, 14:51 UTC
Investigating – We are investigating reports of degraded performance for Issues

GitHub – Disruption with some GitHub services

Dec 6, 17:17 UTC
Resolved – Upon further investigation, the degradation in migrations in the EU was caused by an internal configuration issue, which was promptly identified and resolved. No customer migrations were impacted during this time and the issue only affected GitHub Enterprise Cloud – EU and had no impact on Github.com. The service is now fully operational. We are following up by improving our processes for these internal configuration changes to prevent a recurrence, and to have incidents that affect GitHub Enterprise Cloud – EU be reported on https://eu.githubstatus.com/.

Dec 6, 17:17 UTC
Update – Migrations are failing for a subset of users in the EU region with data residency. We believe we have resolved the issue and are monitoring for resolution.

Dec 6, 16:58 UTC
Investigating – We are currently investigating this issue.

GitHub – Disruption with some GitHub services

Dec 4, 19:27 UTC
Resolved – On December 4th, 2024 between 18:52 UTC and 19:11 UTC, several GitHub services were degraded with an average error rate of 8%.

The incident was caused by a change to a centralized authorization service that contained an unoptimized database query. This led to an increase in overall load on a shared database cluster, resulting in a cascading effect on multiple services and specifically affecting repository access authorization checks. We mitigated the incident after rolling back the change at 19:07 UTC, fully recovering within 4 minutes.

While this incident was caught and remedied quickly, we are implementing process improvements around recognizing and reducing risk of changes involving high volume authorization checks. We are investing in broad improvements to our safe rollout process, such as improving early detection mechanisms.

Dec 4, 19:26 UTC
Update – Pull Requests is operating normally.

Dec 4, 19:21 UTC
Update – Pull Requests is experiencing degraded performance. We are continuing to investigate.

Dec 4, 19:20 UTC
Update – Issues is operating normally.

Dec 4, 19:18 UTC
Update – API Requests is operating normally.

Dec 4, 19:17 UTC
Update – Webhooks is operating normally.

Dec 4, 19:11 UTC
Update – We have identified the cause of timeouts impacting users across multiple services. This change was rolled back and we are seeing recovery. We will continue to monitor for complete recovery.

Dec 4, 19:07 UTC
Update – Issues is experiencing degraded performance. We are continuing to investigate.

Dec 4, 19:05 UTC
Update – API Requests is experiencing degraded performance. We are continuing to investigate.

Dec 4, 19:05 UTC
Update – Webhooks is experiencing degraded performance. We are continuing to investigate.

Dec 4, 18:58 UTC
Investigating – We are currently investigating this issue.

GitHub – [Retroactive] Incident with Pull Requests

Dec 3, 23:30 UTC
Resolved – On December 3rd, between 23:29 and 23:43 UTC, Pull Requests experienced a brief outage and teams have confirmed the issue to be resolved. Due to brevity of incident it was not publicly statused at the time however an RCA will be conducted and shared in due course.

GitHub – Incident with Pull Requests and API Requests

Dec 3, 20:05 UTC
Resolved – This incident has been resolved.

Dec 3, 20:05 UTC
Update – Pull Requests is operating normally.

Dec 3, 20:04 UTC
Update – Actions is operating normally.

Dec 3, 20:02 UTC
Update – API Requests is operating normally.

Dec 3, 19:59 UTC
Update – We have taken mitigating actions and are starting to see recovery but are continuing to monitor and ensure full recovery. Some users may still see errors.

Dec 3, 19:54 UTC
Update – Some users will experience problems with certain features of pull requests, actions, issues and other areas. We are aware of the issue, know the cause, and are working on a mitigation.

Dec 3, 19:48 UTC
Investigating – We are investigating reports of degraded performance for API Requests, Actions and Pull Requests

GitHub – Disruption with some GitHub services

Dec 3, 04:39 UTC
Resolved – This incident has been resolved.

Dec 3, 04:38 UTC
Update – We saw a recurrence of the large hosted runner incident (https://www.githubstatus.com/incidents/qq1m7mqcl6zk) from 12/1/2024. We’ve applied the same mitigation and see improvements. We will continue to work on a long term solution.

Dec 3, 04:16 UTC
Update – We are investigating reports of degraded performance for Hosted Runners

Dec 3, 04:11 UTC
Investigating – We are currently investigating this issue.

GitHub – Disruption with some GitHub services

Dec 2, 01:05 UTC
Resolved – Between Dec 1 12:20 UTC and Dec 2 1:05 UTC, availability of large hosted runners for Actions was degraded due to failures in background VM provisioning jobs. Users would see workflows queued waiting for a runner. On average, 8% of all workflows requiring large runners over the incident time were affected, peaking at 37.5% of requests. There were also lower levels of intermittent queuing on Dec 1 beginning around 3:00 UTC. Standard and Mac runners were not affected.

The job failures were caused by timeouts to a dependent service in the VM provisioning flow and gaps in the jobs’ resilience to those timeouts. The incident was mitigated by circumventing the dependency as it was not in the critical path of VM provisioning.

There are a few immediate improvements we are making in response to this. We are addressing the causes of the failed calls to improve the availability of calls to that backend service. Even with that impact, the critical flow of large VM provisioning should not have been impacted, so we are improving the client behavior to fail fast and circuit break non-critical calls. Finally the alerting for this service was not adequate in this particular scenario to ensure fast response by our team. We are improving our automated detection from this to reduce our time to detection and mitigation of issues like this one in the future.

Dec 2, 00:57 UTC
Update – We’ve applied a mitigation to fix the issues with large runner jobs processing. We are seeing improvements in telemetry and are monitoring for full recovery.

Dec 2, 00:14 UTC
Update – We continue to investigate large hosted runners not picking up jobs.

Dec 1, 23:43 UTC
Update – We continue to investigate issues with large runners.

Dec 1, 23:24 UTC
Update – We’re seeing issues related to large runners not picking up jobs and are investigating.

Dec 1, 23:18 UTC
Investigating – We are currently investigating this issue.

GitHub – Incident with Codespaces

Nov 28, 07:01 UTC
Resolved – This incident has been resolved.

Nov 28, 07:00 UTC
Update – We identified the issue and applied a mitigation, resulting in the cessation of timeouts. While we are considering this incident resolved for now, we are continuing to investigate the root cause and plan to implement a permanent fix. Updates will follow as we progress.

Nov 28, 06:36 UTC
Update – We are investigating issues with timeouts in some requests in Codespaces. We will update you on mitigation progress.

Nov 28, 06:27 UTC
Investigating – We are investigating reports of degraded performance for Codespaces

GitHub – Incident with Sporadic Timeouts in Codespaces

Nov 28, 05:11 UTC
Resolved – This incident has been resolved.

Nov 28, 05:10 UTC
Update – We identified and addressed failures in two proxy servers and applied mitigation. Since then, timeouts have ceased, and we are considering the incident resolved. We will continue to monitor the situation closely and provide updates if any changes occur.

Nov 28, 04:34 UTC
Update – We are investigating some network proxy issues that may be contributing to the timeouts in a small percentage of requests in Codespaces. We will continue to investigate.

Nov 28, 04:03 UTC
Update – We are continuing to investigate issues with timeouts in a small percentage of requests in Codespaces. We will update you on mitigation progress.

Nov 28, 03:32 UTC
Update – We are investigating issues with timeouts in some requests in Codespaces. Some users may not be able to connect to their Codespaces at this time. We will update you on mitigation progress.

Nov 28, 03:29 UTC
Investigating – We are investigating reports of degraded performance for Codespaces