The Schneier on Security blog is a great source of security news from an expert in the field. But I have to disagree with his take in "The CrowdStrike Outage and Market-Driven Brittleness"
Thanks to @Car for linking to it.
Frankly this is partially true but mostly wrong
This brittleness is a result of market incentives. In enterprise computing—as opposed to personal computing—a company that provides computing infrastructure to enterprise networks is incentivized to be as integral as possible, to have as deep access into their customers’ networks as possible, and to run as leanly as possible.
Redundancies are unprofitable. Being slow and careful is unprofitable.
What redundancies are is prudent. What we see with Crowdstrike isn't a problem of the free market. It is a problem with the fiat high time preference economy. Because of cheap credit companies like Crowdstrike are pushed to move faster than they would in a functioning market with free floating interest rates.
Even with the market we have today there are competitors to Crowdstrike that operate differently. This is the market working exposing a mistake in this company's processes. A glaring one. It remains to be seen if the leadership at these companies will make changes in response to this failure.
My guess is that CrowdStrike is going to get sued in a class action suit. Curious what @siggy47 thinks. The company seems to be in damage control mode now not providing much if any new info about what happened. The news industry is ill equipped to report on the episode so it is left to people like Schneier and other industry experts to speculate.
Schneier is right though, many industries are optimized for efficiency but not resilience. We saw this when the global lockdowns were put up. Industry has optimized for throughput not resilience.
Honestly the reliance on Microsoft Windows is the most glaring problem with private and government US infrastructure. This will not be the last time something like this happens and it will probably never be known if this was just a mistake or a bad actor within Crowdstrike. Regardless of which it was, this event shows just how vulnerabilities of these companies and even more importantly their industries.
People will try to say government needs to take action but I would argue it is precisely government interventions that have led to where we are today. The government micromanages both the banking and the airline industry giving them bailouts when poorly run companies should be allowed to fail. If they were allowed to fail, they would be bought by new people who would have the chance to improve the operations. Instead this natural market healing process, not that different from a grass fire would lead to more resilient industries.
American's need to realize how much like the U.S.S.R. we have become. Central planners are all over the place and are slowly destroying this country. Or as @Car likes to call them, communists. At the core of this event was caused by centralization. A single provider is being used by many companies. A single operating system (Windows) is being used when superior systems are available for Internet facing operations.
Notice which industries were NOT affected for the most part. The tech sector. Bruce mentions Netflix "Chaos Monkey tool". So apparently this isn't the fault of the free market. An industry that is more driven by market forces doesn't tend to operate like those older industries. What industries were most affected? Banking, Health care, and Air travel. These three are some of the most regulated and least free market. Banking and air travel have been bailed out multiple times. That's not a coincidence.
Many have said this event could have had much less of an impact had CrowdStrike just rolled out updates using a canary deployment approach. That approach works like this.
- Incremental rollout: A small portion of the fleet (typically 5%-10%) is updated to the new version, while the remaining users continue to use the existing version.
- Monitoring and analysis: The performance of the new version is closely monitored, and metrics are collected to assess its behavior, such as user feedback, error rates, and system resource utilization.
- Gradual expansion: If the new version performs well, it is gradually rolled out to a larger percentage of users, until it reaches 100%.
- Rollback capability: If issues arise, the deployment can be rolled back to the previous version, minimizing downtime and impact on users.
This approach is pretty common in the industry today. Had they used this process it is likely no one would be talking about this event at all. Blaming the free market for this is frankly absurd when you consider the whole event. If anything, the free market makes these events less widespread. If you want to point blame away from Crowdstrike consider fiat time preference forces and central planning's affects on industry.