On July 19, 2024, a significant outage at CrowdStrike, a leading cybersecurity firm, sent ripples through the business world, highlighting the critical need for robust continuity and contingency plans. Elon Musk has described this outages as “The biggest IT fail ever,” and indeed, this incident makes the Robert Morris Jr’s accidental release of the first Internet worm look like child’s play. This incident not only disrupted services for numerous organizations but also underscored several key lessons for the modern digital age.
The Importance of Continuity and Contingency Plans
The CrowdStrike outage serves as a stark reminder of the importance of having well-developed and thoroughly tested business continuity, contingency, and incident response plans. Businesses that had these plans in place were better prepared to handle the disruption, minimizing downtime and operational impact. Continuity plans ensure that critical business functions can continue during and after a disruption, while contingency and incident response plans provide a roadmap for responding to unforeseen events.
In the face of an outage like this, organizations with comprehensive plans can pivot quickly, using backup systems and alternative workflows to maintain essential operations. Without these plans, businesses risk prolonged downtimes, data loss, and severe financial consequences.
The Dangers of Reliance on a Single Security Provider
CrowdStrike’s outage also highlights the dangers of over-reliance on a single security firm. While CrowdStrike is a leader in the cybersecurity industry, the outage illustrates that even the most reputable providers can experience failures. Organizations that depend solely on one provider may find themselves vulnerable when that provider encounters issues.
To mitigate this risk, businesses should consider diversifying their cybersecurity solutions. Using multiple vendors and layered security approaches can provide redundancy and reduce the impact of a single point of failure. This strategy ensures that if one provider goes down, others can continue to offer protection and support.
What Do You Do When the Firm You’ve Hired to Provide You with Security and Resiliency Services is the Cause of the Incident?
When the firm you on which you rely for resiliency services is the cause of the outage, the impact on trust can be devastating and catastrophic. While the CrowdStrike outage does not appear to affect the confidentiality and/or integrity of client information and systems, it severely impacted the often overlooked—yet equally important “A” of the CIA triad: Availability. Availability is, for most organizations, of the same critical import as the confidentiality and integrity of their systems and data. In such scenarios, organizations must have backup, response, and recovery plans that include alternate providers and internal response capabilities (often resorting to manual processes not reliant upon computer resources) to maintain operations to meet service level agreement requirements and minimize the disruptions as much as possible. This underscores the critical necessity for a multilayered approach to resiliency and contingency planning.
The Critical Need for Stress Testing
Another lesson from the CrowdStrike outage is the importance of rigorous stress testing before deploying updates/patches to millions of users. Stress testing involves simulating extreme conditions to evaluate how systems perform under pressure. This process helps identify potential weaknesses and areas for improvement. I do not have all of the information to identify exactly what has happened, but I do know that no patch or update should be deployed in the production environments of millions of endpoints without accurate simulations in testing environments.
In the case of CrowdStrike, thorough stress testing might have revealed issues that could be addressed before causing widespread disruption. Organizations must adopt a culture of continuous testing and validation to ensure their systems can withstand high loads and unexpected challenges.
Testing and Revising Continuity Plans
Developing continuity plans is only the first step; these plans must also be tested and revised regularly. The CrowdStrike incident underscores the necessity of conducting realistic drills and simulations to evaluate the effectiveness of continuity plans. By doing so, organizations can identify gaps and areas for improvement.
Lessons learned from these tests should inform updates to continuity plans, ensuring they remain relevant and effective. Regular testing and revision build organizational resilience, enabling businesses to respond swiftly and effectively to disruptions.
The Role of Technology in Enhancing, Not Replacing, Processes
While technology, including generative AI, offers tremendous benefits, it should enhance our capabilities, not replace all processes. The CrowdStrike outage serves as a cautionary tale about over-reliance on technology. Computers and AI systems are powerful tools, but they are not infallible. Human oversight, judgment, and intervention remain crucial.
Organizations must strike a balance between leveraging technology and maintaining essential human processes. This approach ensures that when technology fails, humans can step in to manage and mitigate the impact.
The Future of AI and Technology Reliance
As we stand on the cusp of widespread generative AI adoption, the CrowdStrike outage raises questions about our increasing reliance on AI and computer systems. If this is what happens now, what will occur as we become more dependent on these technologies?
The future demands a proactive approach to resilience. Organizations must invest in robust continuity plans, diversify their technology providers, and maintain a healthy balance between human and machine roles. Stress testing and continuous improvement should become integral parts of business strategy, ensuring that as our reliance on AI grows, our resilience to disruptions does as well.
Conclusion
The CrowdStrike outage of July 19, 2024, is a wake-up call for businesses worldwide. It highlights the essential need for continuity and contingency plans, the risks of reliance on a single provider, and the critical importance of stress testing. As we navigate the dawn of generative AI and increasing technological reliance, these lessons become even more pertinent. By adopting a proactive, balanced approach to technology and resilience, organizations can better prepare for the challenges of the future, ensuring that disruptions are managed effectively and continuity is maintained.
By: Ryan Meglathery, CISSP, MBA
Executive Principal, Excellens Consulting
Comments