Lizzie
Warren

Microsoft and CrowdStrike Global Outage: How Hexnode could have helped?

Lizzie Warren

Jul 24, 2024

7 min read

Our newsfeeds are buzzing with the headline: “Microsoft CrowdStrike Global Outage.” This incident sent shockwaves across industries, leaving businesses and individuals grappling with the consequences. But….what exactly happened? How did it come to this? And most importantly, why did it occur? If you’re wondering about these questions, you’re in the right place because we have the answers you need!😉

Before we jump in…

To fully grasp the magnitude of this event, it’s essential to understand the key players involved. Microsoft, a household name in software and cloud computing, and CrowdStrike, a leader in cybersecurity, are pivotal in safeguarding digital infrastructure worldwide. Their tools and services are integral to countless businesses, providing essential protection against cyber threats. The outage involving these titans of technology is a stark reminder of the fragility of our interconnected digital ecosystem.

Microsoft CrowdStrike Global Outage: What really happened?

The global outage began on the evening of July 18th, disrupting services dependent on Microsoft and CrowdStrike’s infrastructure. The issue happened because of an update to CrowdStrike’s Falcon sensor, which is a security software that operates at the core (or kernel) level of the Windows operating system. The update was meant to improve the sensor’s ability to detect cyber threats. However, a mistake in the update’s logic caused it to interact improperly with Windows, leading to the infamous “blue screen of death” on many computers. This means that the system crashes and it can’t operate normally. Since the CrowdStrike software runs at kernel level within the operating system, the faulty update had a widespread and severe impact, making it difficult for affected devices to function properly.

What was the effect?

The impact was severe and widespread, affecting various sectors globally. Critical services like air travel faced massive disruptions, with thousands of flights canceled and significant delays. The healthcare sector also suffered, with some surgeries postponed and emergency services experiencing outages. This incident highlighted the crucial role of cybersecurity software in our modern digital infrastructure

Technical Analysis: What went wrong?

Since we have a picture of the scene, let’s dive into the technical details to understand what caused this massive outage.

The root cause:

The outage was triggered by a combination of software problems and vulnerabilities. Initial reports point to a configuration error in Microsoft’s cloud setup. This was made worse by a simultaneous issue in CrowdStrike’s security system, creating a perfect storm of technical challenges. This incident highlights the need for constant monitoring and quick responses to prevent such outages.

The Impact of the outage

  • Operational disruptions: Many organizations faced downtime, impacting their daily operations and productivity.
  • Security vulnerabilities: With security services down, systems were left exposed to potential cyber threats.
  • Data access issues: Users were unable to access essential data, leading to delays in decision-making and potential data loss.
  • Financial losses: The downtime and vulnerabilities resulted in substantial financial losses for many businesses.

How did they fix it?

To fix the outage caused by the faulty CrowdStrike update, normal reboots weren’t helping because the issue was rooted deep in the system’s core (kernel). When the computers tried to restart normally, they would just crash again, leading to continuous “blue screen of death”.

The solutions involved two main approaches: performing a “safe reboot” and “Recover from WinPE.”

  • Safe Reboot: This method starts Windows with only the essential programs and services running. It allowed IT teams to bypass the faulty CrowdStrike sensor update. In safe mode, they could disable, remove, or roll back the problematic update to an earlier version that didn’t cause crashes, thereby restoring normal operation and avoiding further blue screens.
  • Recover from WinPE (Recommended by Microsoft): This approach involves creating a bootable USB drive to repair the device. It provides a quick recovery without needing admin rights. If BitLocker is enabled, you might need to enter the BitLocker recovery key. For devices with other encryption software, follow the vendor’s instructions to unlock the drive so the fix can be applied. If USB connections aren’t available, the Preboot Execution Environment (PXE) can be used, or reimagining the device might be necessary. It’s important to test these recovery methods on multiple devices before using them broadly.

By employing these methods, IT teams addressed the issues caused by the faulty update and restored affected systems to normal operation.

Main challenges faced to overcome the hurdle

  • Technical hurdles in recovery: The recovery process was difficult because many devices needed manual fixes. One major problem was the lack of a phased update rollout, which could have lessened the impact. Companies had to deploy hundreds of engineers to work directly with affected systems and use specific tools to restore PCs. 
  • Cloud vs. On-Premises recovery: Fixing issues in cloud environments like AWS, Azure, and GCP presented unique challenges compared to traditional on-premises systems. Cloud platforms don’t support conventional recovery methods like “safe mode,” so administrators had to use more complex procedures to resolve problems. 
  • The role of BitLocker: Microsoft’s disk encryption technology, BitLocker, played a dual role. While it provided essential security, it also made recovery efforts more complicated. Accessing the BitLocker Recovery Key was necessary to manage disks securely, adding another layer of complexity to the recovery process. 

How UEMs could have helped?

This outage serves as a powerful reminder of the importance of having robust Unified Endpoint Management (UEM) solutions in place. UEMs like Hexnode can play a crucial role in preventing and mitigating the effects of such incidents. Here’s how:

  • Centralized management:
    Hexnode allows IT administrators to manage and monitor all endpoints from a single console. This centralized approach enables rapid detection and response to potential issues, minimizing downtime and disruption.
  • Automated patch management:
    Regular updates and patches are critical to maintaining system security. Hexnode’s patch management ensures that all devices are up to date with the latest security patches, reducing vulnerabilities that could be exploited during an outage. Additionally, Hexnode allows administrators to block updates if necessary, ensuring control over the update process. A phased rollout for updates can also help avoid similar incidents in the future.
  • Real-time monitoring and alerts:
    With real-time monitoring and alert systems, Hexnode can quickly identify unusual activities or potential threats. Immediate alerts enable swift action, preventing minor issues from escalating into major outages.
  • Remote troubleshooting:
    During an outage, remote troubleshooting capabilities allow IT teams to diagnose and resolve issues without being physically present. Hexnode’s remote access features facilitate quick fixes, reducing the impact of the disruption.
  • Compliance and security policies:
    Hexnode enforces compliance and security policies across all devices, ensuring that security protocols are consistently applied. This reduces the risk of security breaches and ensures that all endpoints adhere to best practices.

How to get back on track?

For businesses already affected by the blue screen of death, Hexnode UEM can be instrumental in getting operations back on track:

  • BitLocker recovery password management: Hexnode offers the option to escrow the BitLocker recovery password to the Hexnode UEM portal, simplifying secure access and management during recovery efforts.
  • Enhanced remote support: With Hexnode’s robust remote support capabilities, IT teams can efficiently troubleshoot and resolve issues on affected endpoints, reducing downtime and restoring normalcy quickly.

Conclusion

The Microsoft-CrowdStrike global outage has been a sobering reminder of the vulnerabilities inherent in our digital world. By examining what happened and learning from the response efforts, we can better prepare for future incidents. With Hexnode UEM, you can regain control and ensure your systems are secure and operational after an outage. From centralized management and automated patch updates to remote troubleshooting and comprehensive reporting, Hexnode provides all the tools you need to get back on track quickly and efficiently. Don’t let an outage disrupt your business—let Hexnode help you stay resilient and prepared for any future challenges.

Share
Lizzie Warren

A lil clumsy and a whole lot smiley, I'll bump into you with a smile...

Share your thoughts