
CrowdStrike outage: Y2K in 2024?
Thanks for reading this Saturday morning. Today we’re examining what was essentially Y2K +24. The CrowdStrike outage of July 19, 2024, grounded thousands of flights, delayed public transit, and impacted medical procedures at hospitals.
What happened, and more importantly, what does it say about our reliance on computer systems?
***
CrowdStrike is a prolific cybersecurity firm whose software is utilized by 298 of the world’s Fortune 500 companies. Because the company’s services are so widely used, its activities – and regular security updates – are deeply embedded in the Microsoft Windows operating system, along with Apple and Linux to a lesser extent, the technology outlet TechTarget reported.
CrowdStrike regularly delivers “sensor configuration” updates – several per day, sometimes – to keep users protected from cybersecurity threats. The updates usually happen in the background automatically to systems that are operating and connected to the internet.
The update in question, delivered at 0409 UTC on July 19, contained a flaw in its logic. It applied only to Microsoft Windows systems, and the flaw caused systems that applied the update (i.e., systems that were running and online at the time) to crash, TechTarget reported.
CrowdStrike recognized the flaw almost immediately. The company fixed the flaw and in fact delivered a resolved update just 79 minutes later, at 0527 UTC. By that time, though, 8.5 million Windows devices had already applied the flawed update. Critical systems – at airlines, hospitals, and other essential infrastructure – went down, causing a global catastrophe. Affected computers displayed the “blue screen of death,” leaving IT personnel incapable of quickly restoring their systems.
American Airlines, United, and Delta requested a global ground stop for all flights. Health care systems delayed or canceled procedures. Some 911 systems were impacted. The crisis – and news stories about it – spread quickly.
CrowdStrike’s crisis response was criticized, perhaps for good reason – hours after the flawed update went out, CrowdStrike CEO George Kurtz released a statement that did not contain any apology at all, just details about the error. The headline in Fortune read: “CEO at cybersecurity firm that caused a global outage forgot to apologize” – ouch. Kurtz was then swiftly whisked away by his public relations team and quickly appeared on national news to talk about the issue and answer questions. His immediate response while technical and direct, was likely a symptom of prioritizing speed over perfection.
Meanwhile, CrowdStrike in a matter of hours put together instructions for customers to enact manual workarounds to rid their systems of the flawed update. It’s this process – the manual fix – that looks to have caused most of the fallout simply because how long process takes.
Unwinding the damage required extensive work, sometimes on each individual machine. As one can imagine, even for sophisticated companies, that type of undertaking can take a very long time. All told, the damage may be upwards of $5 billion.
The proximate issue is largely behind us now, but it raises important questions about single points of failure that can ripple across critical systems.
Time Magazine published a smart piece harkening back to the Y2K crisis at the turn of the century. Some view that period as one of unnecessary hysteria, but the problem was real. That computer code transitioned from “1999” to “2000” without issue indicates government, business, and infrastructure leaders took appropriate action, not that no problem ever existed.
That period forced an examination of society’s reliance on computer technology. Industry and government leaders took to heart the true risks associated with widespread failures, and recognized the need for fail-safes and plan Bs.
Those lessons appear to have atrophied in the decades since. CrowdStrike is a single point of failure, something that should have been obvious to the company, its regulators, and its clients for many years.
David Brumley, a computer engineering professor at Carnegie Mellon University, told Time that other technology companies implement procedures to mitigate against the risk of flawed updates causing mass disruption. “Companies like Google will roll out updates incrementally so if the update is bad, at least it will have limited damage,” he said.
Instead, it looks like CrowdStrike pushed the update to its entire user base at the same time, making it impossible to recognize the problem before it proliferated.
Congress has already called on Kurtz to appear before the House Homeland Security Committee. And FTC Chair Lina Kahn said that “these incidents reveal how concentration can create fragile systems.”
Whether regulatory action results from the failure remains to be seen. But companies – especially those responsible for critical infrastructure like transportation and health care – also have a responsibility to examine their own systems to identify fail points and employ redundancies.
Recent Articles
When Prayers are Heard, Answers Come: Finding Peace in Times of Crisis
From the Desk of Chuck Fuller “Our prayers, sir, were heard, and they were graciously answered” Many years ago, I led a task force responding to a series of tornados that devastated parts of eastern North Carolina. In our early discussions, we debated how many people needed to be impacted for us to implement a…
Read MorePassenger rail’s future in North Carolina
Thank you for joining us this Saturday morning. America once led the world in train travel. Routes spanned the continent and most people had easy access to a train station. But trains gave way to cars and planes, and by the 1960s privately owned train companies were bleeding revenue. In 1970, Congress passed and President Richard Nixon…
Read MoreThe Golden LEAF Foundation and North Carolina
Thank you for joining us this Saturday morning. Today we’re diving into one of the most impactful North Carolina organizations of the past 25 years: Goldean LEAF (“LEAF” stands for long-term economic advancement foundation)The nonprofit has largely flown under the radar for the past decade. That lack of public attention or controversy should be seen…
Read More