How SIRKit Handled the CrowdStrike Incident: A Night of Teamwork (and Why We Still Trust CrowdStrike)
In the fast-paced world of IT, things go sideways at the most inconvenient times. A couple of months ago, we had firsthand experience with this when a routine update from CrowdStrike caused an unexpected server outage for a large portion of our clients. Servers were blue-screening and refusing to boot back up. Fortunately, our response was swift, coordinated, and highly effective. This incident tested our Incident Response (IR) and Disaster Recovery (DR) plans in real time, and we’re proud to say our team was able to minimize the impact and get our clients back online before their businesses were disrupted. Here’s how the night unfolded and why having a solid MSP partner is incredibly valuable when disaster strikes—and why we still stand firmly behind CrowdStrike as the best in the business.
The Start of a Long Night: A Midnight Call to Action
It was around 11:30 PM Mountain Time when our 24/7 monitoring systems went into overdrive, reporting widespread outages across multiple client servers. Normally, we see the occasional alert at night, but this time it was different. The scale of the outage was massive, and all signs pointed to something unusual. We immediately called in our Incident Response (IR) team, waking up additional staff to jump in and help.
Our initial investigation showed that something had triggered a blue screen of death (BSOD) on many servers and some PCs. Within 30-40 minutes, news about a problematic update from CrowdStrike started trickling in from overseas. The update had been released hours earlier, and IT professionals in Europe were scrambling to resolve the issue.
CrowdStrike’s Fast Response and Manual Fixes
CrowdStrike’s support team quickly jumped into action, posted a fix in their portal and started rolling back the changes to prevent further impact. While we were frustrated by the situation, we had to give CrowdStrike credit for their rapid diagnosis and response—having a fix already available was a lifesaver.
However, the fix wasn’t simple. It involved manually applying changes to each affected server’s recovery console during the start-up process. This meant hands-on work for each server, which is tedious and time-consuming when you account for potentially thousands of systems around the country. But we weren’t going to let that stop us.
Full-Scale Response: Prioritizing Critical Systems
Using our Remote Monitoring and Management (RMM) tools, we quickly compiled a list of all impacted servers.
By 12:30 AM, our team was in full swing, systematically applying the fix manually using remote tools. We prioritized 24/7 operations first, ensuring that businesses relying on continuous uptime would be restored first. Next, we tackled servers for businesses set to open in a few hours, aiming to restore operations well before the start of the business day.
We proactively dispatched on-site technicians for the few systems we couldn’t fix remotely. They arrived at client sites when the business opened, ensuring the fix was applied quickly and that the business could carry on as usual.
The Power of a Well-Oiled Team
By 4:00 AM, we had restored services to the vast majority of our clients around the country, with only a handful of cases left. Communication was key throughout the night. We sent regular updates to all clients, keeping them informed of the situation, progress, and expectations. This level of transparency reassured our clients that we were in control, working nonstop to protect their businesses.
Our priority was to ensure most clients were fully operational before they opened for business. By the time employees started arriving at work, servers were already online and functioning, minimizing any noticeable disruptions. In fact, many businesses had no idea anything had even happened unless they checked their email.
Starting around 7:00 AM, a few users called our helpdesk with issues related to their PCs—these were the 5% of PCs that the bug impacted—the overall impact was minimal.
By 11:00 AM, nearly all support tickets had been resolved. It was pretty much over.
Our quick, organized response meant that while other businesses across the world were facing extended outages, our clients were back up and running before they even opened their doors.
We Still Stand Behind CrowdStrike
It’s important to address the elephant in the room: CrowdStrike’s update caused this issue, but we still stand firmly behind their product. Yes, bugs happen—and they happen with all software, no matter how carefully it’s developed. But here’s the key: CrowdStrike responded quickly, their support team was on it, they acknowledged their mistake, and they provided a solution in record time. That’s the mark of a company dedicated to its clients and partners.
Even with this hiccup, we’re confident that CrowdStrike remains the world’s #1 Endpoint Protection solution.
The facts back it up:
- According to independent tests by AV-Comparatives, CrowdStrike has a 99.7% malware detection rate. That’s nearly flawless detection across millions of threats.
- In the 2023 Gartner Magic Quadrant for Endpoint Protection Platforms, CrowdStrike was named a Leader for the fourth year in a row, consistently ranking at the top for both completeness of vision and execution.
- CrowdStrike has been shown to stop breaches in under 1 minute, detect threats in under 10 minutes, and respond to incidents within 60 minutes—metrics that no other competitor can consistently match.
- According to Forrester’s Total Economic Impact Report, companies using CrowdStrike save an average of $1.4 million annually in security costs due to reduced breaches and downtime.
Bugs Happen, Security is the Priority
As much as we wish every software update could roll out flawlessly, the reality is that technology is complicated, and issues will occasionally arise. What matters most is how quickly a solution is found and how secure your business remains before and after the event.
CrowdStrike’s next-gen security capabilities, including machine learning and behavioural AI, are still among the best in the world, and they’ve played a huge role in keeping our clients safe from ransomware, malware, and other threats. In fact, over 14,000 global customers, including 65 of the Fortune 100, trust CrowdStrike to protect their most critical assets. That’s a strong endorsement from some of the world’s largest and most security-conscious organizations.
Key Takeaways: Why IR & DR Plans Matter
We often stress the importance of Incident Response (IR) and Disaster Recovery (DR) plans to our clients, and this incident was a prime example of why these plans are critical. While many businesses, including major airlines, were affected for extended periods, we got our clients up and running within hours. It wasn’t magic—it was careful planning, thorough testing, and a highly skilled team that knew exactly what to do when things went wrong.
Even though only 5% of PCs were affected, the downtime could have been disastrous if we hadn’t reacted so quickly. The fact that the vast majority of servers were restored before businesses opened shows just how effective our planning and execution were.
Why Your MSP Should Be Ready for Anything: Final Thoughts
When disaster strikes, your Managed Service Provider (MSP) must be ready to jump into action immediately. Whether it’s a failed update, a cyberattack, or another unforeseen event, you need a partner who goes above and beyond to protect your business. The CrowdStrike incident was an opportunity for us to prove exactly that.
While many organizations around the world were offline for extended periods, SIRKit’s clients were back online within hours. Most were fully operational before their businesses even opened, and the handful of remaining issues were quickly resolved. This is the level of service you should expect from your MSP—a team that reacts fast, communicates clearly, and solves problems with precision.
This incident also highlights the importance of regularly testing your Disaster Recovery (DR) and Incident Response (IR) plans. Having them in place is not enough; they need to be tested under pressure to ensure they work in real scenarios. At SIRKit, our response was a success because we’ve tested and refined our processes over time, and we’re proud that we delivered on our promise to protect our clients’ businesses.
In today’s world, IT issues are inevitable, but unreasonable downtime doesn’t have to be. Whether it’s a bad update or a critical security threat, your MSP should be ready for anything.
Are you not loving your MSP? Let’s chat.