A bus stop with a Windows blue screen of death

How a disastrous software update took down 8.5 million devices

July 19th 2024 – that day for many of us will never be forgotten. Over 5 000 flights cancelled, major American TV stations unable to broadcast, hospitals cancelling non-emergency operations… In total 8.5 million computers affected, causing chaos in almost every industry. This could feel like a large-scale cyberattack, but the root cause was a lot different. A faulty software update that should never happen.

Blue Screens of Death

The first signs about the global breakdown started at around 04:09 UTC on 19th of July, 2024. People started to realize that the computers suddenly started to show so-called “Blue Screen of Death” (BSoD). What is it? As the name suggests, computer screen displays a blue background. On top of that, there is a white text on top, detailing the error message and some technical information about the crash.

The “Blue Screen of Death” is a response that the system has encountered a serious error. It can be also described as a safety mechanism that protect a computer from more damage that could potentially happen. Blue Screen can be caused by hardware or driver-related issues, but also from a third-party software, that operates on a kernel-level of operating system.

This could validate the point, why initially many believed that the world is suspecting a huge cyberattack. After all, those Blue Screens were almost everywhere – on airports, in banking systems, in hospitals. Even things like self-checkout machines got affected. However, there were two major details that combined made a cyberattack scenario less relevant:

  1. It happened to computers that run on Windows operating system;
  2. Affected machines entirely belonged to enterprises – individuals’ computers were left untouched.

The former was putting Microsoft – a company behind Windows OS – as potential source of the breakdown. The latter was actually denying their blame, as only devices of some specific companies got Blue Screen of Death. You might be wondering what do firms like Starbucks, WizzAir and Paypal have in common. Really, not much, except they (and much more) were using the same kernel-level software – Crowdstrike Falcon.

What is Crowdstrike Falcon

Crowdstrike Falcon is a flagship product of a cybersecurity company named Crowdstrike. Released in June 2013, it serves as an antivirus software equipped with advanced AI algorithms. It is a very popular choice among large-scale enterprises, where security is a critical factor. As of now, a little more than a year after the incident, over 6000 companies are using Crowdstrike Falcon as their cybersecurity tooling. This is confirmed by enlyft website :

We have data on 6,159 companies that use Crowdstrike Falcon Platform. The companies using Crowdstrike Falcon Platform are most often found in United States and in the Information Technology and Services industry. Crowdstrike Falcon Platform is most often used by companies with 1000-5000 employees and >1000M dollars in revenue.

Falcon’s job is to detect potential security threats on machine it is intalled on. The software is constantly scanning the device, in search for atypical activities like malware. Once such danger is found, Falcon is trying to get rid out of it, keeping the device safe.

Kernel access

As Crowdstrike Falcon performs an external device scanning in real time, it requires a complete access to the machine’s memory. That is why such software operates on a privilaged layer called kernel mode . As opposed to regular “user mode” applications, programs operating in kernel mode have unrestricted access to all system resources.

Not many of available software have privileage to flawlessly operate on Kernel level. In fact, for Windows system, there is a Windows Hardware Quality Labs certificate (WHQL) – a special Microsoft program that ensures integrity of drivers across different Windows versions. For a software to be certified, it has to pass number of strict stability and safety tests to ensure that given kernel software (driver) can be used without issues. A driver that succesfully finish the severe process receives a WHQL certificate and the Windows system will not show potential warnings before installation. Certification means that the driver is trusted by Microsoft, and will not cause any harm to computer or its user.

On one hand, Kernel access gives a lot of control and power to any software running there, but on the other hand it involves some inevitable danger. What if a program crashes due to an error or some bug? Well, in “user mode”, the application crashes, making just the app inaccessible; but the system remains untouched. However, when the same happens in kernel mode – the entire system crashes, with any unsaved data being lost forever. This scenario forces your operating system to protect device hardware from being further damaged, and for this reason Blue Screen of Death appears.

That is what exactly happened to computers with CrowdStrike Falcon installed. As it turned out later, Blue Screen was caused by a prior error introduced with a newest Falcon update.

Over-the-Air Update

You are probably familiar with Windows system updates. Every time Microsoft deploy a new possible patch for Windows, you are notified in a “Start Tab” that more recent updates are available to install. However, Microsoft on their own are not performing updates on your machine. You are free to install the update whenever you want to do so.

In contrast, Crowdstrike Falcon software does not wait for user permission to perform an update. Every single update is done automatically, none prior consent is required. Such updates are named Over-the-Air (OTA) – they are appearing “out of thin air”. This approach ensures that the security updates or regular program improvements are delivered in a quick go, at (almost) the same time for all Falcon users. It also means that updates are “silent”, as user is not notified about anything, which is very convenient for both users and Crowdstrike.

How does a kernel-mode software update relate to a WHQL certificate validity? Any software update could make it no longer compliant with certification rules and tests – for example, when a bug is added by mistake. Hence every update would need a certificate reevaluation, repeating the same tedious process over again. Following the pattern every time would delay important software updates to go live, and Crowdstrike knew about that. They did not want to wait; they played a risky game instead. No recertifications, just instant updates.

One of these post-WHQL updates came on feral 19th of July 2024…

A tedious recovery

The global outage was locking affected computers in Blue Screen state, and in most cases the working solution was to reboot the computer in Safe Mode. Easier said than done – just take a look at below instruction, officialy provided by Crowdstrike:

Remember that of computers were used for work by non tech-savvy people, who most likely do not have enough knowledge to perform that reboot. Even if some do, in-organisation restrictions may likely prohibit employees from rebooting on their own. The entire fixing has to be done manually by IT workers. A very time-consuming and annoying process. Especially that most of affected companies were medium and large businesses with hundreds or oftentimes thousands of employees.

Final patch and apology

In order to prevent more computers from being shut down by the faulty update, Crowdstrike company had to ship yet another update to patch prior bug. They quickly located the logic error inside their update code and fixed it with a new update, deployed at 05:27 UTC. Thanks to this action, computers that were turned on after this hour were not affected at all. “Only” 8.5 million machines active between 04:09 – 05:27 UTC were displaying Blue Screen of Death and needed to be fixed manually, as we described above.

The patching update contained a proper fix to a Channel 291 file, which was the direct cause of Crowdstrike Falcon error. This file was not used for Linux and MacOS versions of Crowdstrike Falcon, and these opertaing systems were not affected. It also turned out that Channel 291 and other Channel files were not a kernel drivers, meaning they were not a part of prior WHQL certification.

Crowdstrike company had clearly admitted their fault, and also confirmed that the outage had nothing to do with a cyberattack. George Kurtz, then CEO of Crowdstrike, apologized for the catastrophic outage the same day. He has posted an apology in company blog entry:

I want to sincerely apologize directly to all of you for today’s outage. All of CrowdStrike understands the gravity and impact of the situation. We quickly identified the issue and deployed a fix, allowing us to focus diligently on restoring customer systems as our highest priority.

A full apology post from Crowdstrike can be seen here .

Outage in numbers

The total time of global breakdown from when it started to the release of a bugfix patch was approximately 79 minutes. This time was enough to cause an enormous damage to global economy. To sum up this story, here you can see an estimated scale of the outage:

  • More than 10.000 flights cancelled or delayed;
  • 8.500.000 computers displaying Blue Screen of Death;
  • Manual fixing the computers Blue Screen is at least $701 million in man-hours
  • Around 60% of Fortune 500 companies were affected, losing $5.4 billion in revenues and gross profit;
  • Worldwide financial damage is around $10 billion

A 1 hour breakdown that will never be forgotten.

Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *