Coinbase hit by cascading systems failure after "thermal event" in AWS data centre

Critical financial infrastructure went down for hours due to an overheating incident on a dreary spring evening in Virginia

Share
Outages at centralized crypto exchanges could have catastrophic global effects
Outages at centralized crypto exchanges could have catastrophic global effects

The crypto exchange Coinbase was forced to halt trading after an outage at an AWS data centre triggered cascading failures across its critical financial systems.

Last Thursday, a cooling system glitch at an AWS facility in Virginia disrupted spot trading, derivatives, and international exchange services.

Coinbase said the “thermal event” impacted hardware supporting its exchange matching engine and Kafka messaging systems, triggering simultaneous failures across infrastructure designed to maintain continuous availability during data centre incidents.

The company’s architecture relies on a primary exchange replica operating within a single AWS availability zone to minimise latency for traders.

Although Coinbase maintains distributed standby infrastructure for resilience, safeguards intended to isolate failures within the affected zone failed to contain the disruption.

As hardware failures spread through the environment, Coinbase’s distributed matching engine lost quorum (the minimum number of healthy servers or nodes that must agree with each other for a distributed system to keep operating safely), preventing the platform from safely processing orders and maintaining order books.

Trading activity across retail, advanced and institutional products was temporarily halted as a result.

At the same time, Kafka clusters responsible for coordinating messaging between Coinbase systems also failed, despite being designed to remain operational during data centre outages.

The company said the recovery process required extensive manual intervention, including failovers to new hardware brokers and disaster recovery procedures involving many terabytes of replicated data.

Engineers used automated tooling to drain workloads from about 10 Kubernetes clusters to stabilise internal systems, restoring most services within roughly 30 minutes of diagnosing the issue.

However, restoring the exchange infrastructure itself and recovering Kafka partitions required more extensive remediation under constrained infrastructure conditions.

Coinbase said no customer data was lost, though users experienced delayed balance updates and temporary loss of access to trading services while systems recovered.

Finance and fragility

The incident is a reminder of the frightening fragility of tightly interconnected modern financial systems, with a cooling failure affecting a subset of racks in a single building cascading into disruptions across major crypto trading systems despite multiple layers of redundancy and disaster-recovery planning.

Brian Armstrong, chief executive of Coinbase, acknowledged that the outage revealed weaknesses in the company’s resilience strategy.

"We design our services to be redundant to downtime in any one AWS Availability Zone (AZ), and most of our systems worked this way last night, but not all," Armstrong said.

READ MORE: Jeffrey Epstein, Satoshi Nakamoto and the strange death of the Bitcoin dream

"Our centralized exchange did not. Exchanges have unique architectures that optimize for latency and co-location of clients.

"It is possible to make exchanges resistant to AZ failures, but this can introduce latency delays that are not desirable along with breaking customer co-location."

Armstrong said Coinbase would reassess those architectural trade-offs following the incident and seek to reduce the time required to move exchange infrastructure between availability zones during future failures.

The Coinbase post-mortem

Rob Witoff, Coinbase’s head of platform, revealed that internal monitoring systems first detected cascading quote failures shortly before midnight UTC, rapidly escalating into multiple Sev1 (severity 1) incidents affecting customer-facing services.

He said simultaneous failures in exchange hardware and Kafka infrastructure forced engineers to perform manual disaster-recovery procedures after systems intended to isolate the outage failed to behave as expected.

"The team built, tested, deployed, and validated the fix while continuing to manage the broader incident," Witoff said.

"What went right: the team. Incident response across the company came together within minutes, followed well-rehearsed playbooks and used secure automation tooling to recover all services.

"We have a strong, senior team at Coinbase that worked through rare failure modes to recover all services."

The agentic threat to systemic stability

The incident is concerning because it demonstrates how seemingly minor physical failures can trigger disproportionately large systemic consequences.

Modern infrastructure has become so tightly coupled that localised faults increasingly propagate outward into broader operational failures. The most worrying point is that things are about to get even more fragile.

For decades, software failures were at least partially constrained by human bottlenecks. Humans reviewed transactions, checked alerts, approved deployments and noticed when systems behaved strangely.

Agentic AI is beginning to change that equation.

Autonomous systems are increasingly being granted the ability to make decisions, invoke tools, modify configurations, execute workflows and interact with other systems with minimal human oversight.

READ MORE: Bank of England warns of risks lurking in “opaque and hidden corners” of the financial system

Today, a cooling failure inside a hyperscale data centre can disrupt global trading infrastructure. Tomorrow, similar incidents may intersect with AI-driven cloud orchestration systems, autonomous cybersecurity tooling, automated operational workflows and machine-speed financial decision engines in ways humans struggle to model or fully understand.

In traditional software systems, failures were often contained. In highly autonomous systems, failures may amplify themselves, with small faults propagating across agents, tools and workflows until local instability becomes systemic disruption.

A corrupted input, hallucinated assumption or faulty automated action could potentially spread across thousands of interconnected processes before human operators fully understand what is happening.

The result is a world where seemingly mundane failures - a cooling malfunction, a software update, a corrupted API response - can trigger consequences wildly disproportionate to their original cause.

Meanwhile, humans will be left struggling to understand what happened because systems are too complex for any of us to comprehend.

Which means, as W.B. Yeats might have said, that things will almost certainly fall apart because the (data) centres cannot hold.

Follow Machine on LinkedIn