Google reveals how Service Control foul up caused mega-outage that brought down the internet

Digital civilisation fell to its knees last week as services like ChatGPT, Spotify and Discord were hobbled by a Google Cloud error.

Google reveals how Service Control foul up caused mega-outage that brought down the internet
(Photo by Michael Dziedzic on Unsplash)

Much of the internet collapsed last week following a giga-outage that wiped out ChatGPT, Cloudflare and a wide range of Google Cloud services.

Now Google has revealed the reason for the gigantic (but thankfully short-lived) snafu which brought down ChatGPT, Cloudflare, Discord, Spotify and a number of apps raising from Gmail to Calendar.

Problems with Google Cloud started at about 5pm ET on Friday March 13 and caused mayhem across the world.

Google said the issue was caused by problems with Service Control, a core binary that manages APIs and control planes to ensure every API request is authorised properly.

On May 29, 2025, a new feature was added to Service Control to enable additional quota policy checks. This code change did not have appropriate error handling and a null pointer then caused the binary to crash.

A policy change was inserted into regional Spanner tables that Service Control uses for policies.

"Given the global nature of quota management, this metadata was replicated globally within seconds," Google explained. "This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop."

Within 2 minutes, Google's Site Reliability Engineering team was triaging the incident - identify the root cause within 10 minutes and deploying a "red button" to fix the issue within 25.

"Within 40 minutes of the incident, the red-button rollout was completed, and we started seeing recovery across regions, starting with the smaller ones first," Google wrote.

Roughly three hours later, the problem was largely resolved.

How will Google prevent outages in future?

To stop a similar incident from happening again, Google will enhance the resilience and reliability of its systems by modularising Service Control’s architecture to isolate functionality and allow API requests to proceed even if certain checks fail.

The company will audit all systems that rely on globally replicated data to ensure replication is incremental and verifiable, regardless of the need for near-instant consistency. All changes to critical binaries will be gated behind feature flags and disabled by default.

Google will also improve static analysis and testing to better handle errors and default to fail-open behaviour when necessary, while enforcing randomised exponential backoff across systems.

Additionally, it will strengthen both automated and human external communications to provide customers with timely, actionable information during incidents. To support business continuity, Google will ensure that its monitoring and communication infrastructure remains operational even during outages affecting Google Cloud or its primary monitoring tools.

In a statement, Google said: "We deeply apologise for the impact this outage has had. Google Cloud customers and their users trust their businesses to Google, and we will do better. We apologise for the impact this has had not only on our customers’ businesses and their users but also on the trust of our systems.

"We are committed to making improvements to help avoid outages like this moving forward."

What services and applications went down in the global mega outage?

ChatGPT was perhaps the the most prominent victim of the outage - the second it suffered in the same week.

On its status page, OpenAI wrote: "We are aware of issues affecting multiple external internet providers that are impacting the availability of our services such as single sign-on (SSO) and other log-in methods. Thank you for your continued patience."

On its own status website, Cloudflare wrote: "Cloudflare’s critical Workers KV service went offline due to an outage of a 3rd party service that is a key dependency. As a result, certain Cloudflare products that rely on KV service to store and disseminate information are unavailable including:

"Cloudflare engineers are working to restore services immediately. We are aware of the deep impact this outage has caused and are working with all hands on deck to restore all services as quickly as possible."

Downdetector also reported that people were having difficulties accessing a range of other services, including Spotify and Snapchat.

Do you have a story or insights to share? Get in touch and let us know. 

Follow Machine on XBlueSky and LinkedIn