CrowdStrike has determined that the global IT outage was the result of buggy testing software that didn’t catch flawed data.
The security firm said a Content Validator bug let a template instance (a definition of how the Falcon Sensor fights threats) pass muster in spite of “problematic content data” in a file. The bad info led to an out-of-bounds memory read when delivered as Rapid Response Content, producing an exception that Windows PCs couldn’t handle without crashing.
CrowdStrike said it would prevent a repeat outage through multiple measures. It promised more sophisticated Rapid Response Content testing, more checks in the Content Validator, and improved error handling. It also planed a “staggered” deployment approach to minimize risk, and better performance monitoring, more update details, and more control over how and when Rapid Response items arrived.
The faulty update affected about 8.5 million Windows systems, according to Microsoft. The crashes knocked out IT infrastructure at businesses worldwide, including at airports, broadcasters, and payment platforms. A manual fix was available soon afterward, but it took a while for some companies to recover. Delta Air Lines is even facing an investigation for the extended downtime.
CrowdStrike is already dealing with repercussions from the botched update. Its share price dropped 15% soon after the outage began, while CEO George Kurtz has been called to testify before Congress and explain both what happened and how the company will avoid a repeat.
While the new safeguards should help, it’s not clear why some of them weren’t present before. It’s commonplace to stagger software updates, particularly critical ones. Google often pushes updates to Android and key services gradually, for instance. The procedures wouldn’t necessarily have stopped the outage, but they might have limited the scope.