Financial Systems

Coinbase hit by cascading systems failure after "thermal event" in AWS data centre

Critical financial infrastructure went down for hours due to an overheating incident on a dreary spring evening in Virginia

Outages at centralized crypto exchanges could have catastrophic global effects

The crypto exchange Coinbase was forced to halt trading after an outage at an AWS data centre triggered cascading failures across its critical financial systems.

Last Thursday, a cooling system glitch at an AWS facility in Virginia disrupted spot trading, derivatives, and international exchange services.

Coinbase said the “thermal event” impacted hardware supporting its exchange matching engine and Kafka messaging systems, triggering simultaneous failures across infrastructure designed to maintain continuous availability during data centre incidents.

The company’s architecture relies on a primary exchange replica operating within a single AWS availability zone to minimise latency for traders.

Although Coinbase maintains distributed standby infrastructure for resilience, safeguards intended to isolate failures within the affected zone failed to contain the disruption.

As hardware failures spread through the environment, Coinbase’s distributed matching engine lost quorum (the minimum number of healthy servers or nodes that must agree with each other for a distributed system to keep operating safely), preventing the platform from safely processing orders and maintaining order books.

Trading activity across retail, advanced and institutional products was temporarily halted as a result.

At the same time, Kafka clusters responsible for coordinating messaging between Coinbase systems also failed, despite being designed to remain operational during data centre outages.

The company said the recovery process required extensive manual intervention, including failovers to new hardware brokers and disaster recovery procedures involving many terabytes of replicated data.

Engineers used automated tooling to drain workloads from about 10 Kubernetes clusters to stabilise internal systems, restoring most services within roughly 30 minutes of diagnosing the issue.

However, restoring the exchange infrastructure itself and recovering Kafka partitions required more extensive remediation under constrained infrastructure conditions.

Coinbase said no customer data was lost, though users experienced delayed balance updates and temporary loss of access to trading services while systems recovered.

Finance and fragility

We experienced an outage at Coinbase last night, which is never acceptable. The root cause was a room overheating in an AWS datacenter when multiple chillers failed. We design our services to be redundant to downtime in any one AWS Availability Zone (AZ), and most of our systems…
— Brian Armstrong (@brian_armstrong) May 8, 2026

The incident is a reminder of the frightening fragility of tightly interconnected modern financial systems, with a cooling failure affecting a subset of racks in a single building cascading into disruptions across major crypto trading systems despite multiple layers of redundancy and disaster-recovery planning.

Brian Armstrong, chief executive of Coinbase, acknowledged that the outage revealed weaknesses in the company’s resilience strategy.

"We design our services to be redundant to downtime in any one AWS Availability Zone (AZ), and most of our systems worked this way last night, but not all," Armstrong said.

The Coinbase post-mortem

Rob Witoff, Coinbase’s head of platform, revealed that internal monitoring systems first detected cascading quote failures shortly before midnight UTC, rapidly escalating into multiple Sev1 (severity 1) incidents affecting customer-facing services.

He said simultaneous failures in exchange hardware and Kafka infrastructure forced engineers to perform manual disaster-recovery procedures after systems intended to isolate the outage failed to behave as expected.

"The team built, tested, deployed, and validated the fix while continuing to manage the broader incident," Witoff said.

"What went right: the team. Incident response across the company came together within minutes, followed well-rehearsed playbooks and used secure automation tooling to recover all services.

"We have a strong, senior team at Coinbase that worked through rare failure modes to recover all services."

The agentic threat to systemic stability

The incident is concerning because it demonstrates how seemingly minor physical failures can trigger disproportionately large systemic consequences.

Modern infrastructure has become so tightly coupled that localised faults increasingly propagate outward into broader operational failures. The most worrying point is that things are about to get even more fragile.

For decades, software failures were at least partially constrained by human bottlenecks. Humans reviewed transactions, checked alerts, approved deployments and noticed when systems behaved strangely.

Agentic AI is beginning to change that equation.

Autonomous systems are increasingly being granted the ability to make decisions, invoke tools, modify configurations, execute workflows and interact with other systems with minimal human oversight.

Coinbase hit by cascading systems failure after "thermal event" in AWS data centre

Finance and fragility

READ MORE: Jeffrey Epstein, Satoshi Nakamoto and the strange death of the Bitcoin dream

The Coinbase post-mortem

The agentic threat to systemic stability

READ MORE: Bank of England warns of risks lurking in “opaque and hidden corners” of the financial system

Follow Machine on LinkedIn

Read more

Why AI and deterministic offensive security tools should be used together, not separately

Leep Ring review: A sleep-monitoring smart ring to rule them all?

fast16 exposed: Cyber sabotage malware first teased in NSA leaks could be older than Stuxnet

Nation-state actors now behind majority of serious UK incidents, NCSC security chief warns

Finance and fragility

READ MORE: Jeffrey Epstein, Satoshi Nakamoto and the strange death of the Bitcoin dream

The Coinbase post-mortem

The agentic threat to systemic stability

READ MORE: Bank of England warns of risks lurking in “opaque and hidden corners” of the financial system

Sign up for the Machine.news weekly newsletter (coming soon)

Follow Machine on LinkedIn

Read more

Why AI and deterministic offensive security tools should be used together, not separately

Leep Ring review: A sleep-monitoring smart ring to rule them all?

fast16 exposed: Cyber sabotage malware first teased in NSA leaks could be older than Stuxnet

Nation-state actors now behind majority of serious UK incidents, NCSC security chief warns