What’s Old is New: The CrowdStrike Outage – “Sometimes patches break things…”

Posted July 25, 2024 by Kevin Finch 

It was all over the news and it probably affected your life in some way too. Security vendor CrowdStrike pushed out a faulty software update that made windows PCs crash on reboot, just after Midnight Eastern Time the morning of July 19th, 2024. Whether you were one of the thousands of people that were stranded at an airporthad problems crossing the borderhad problems with your health care network, couldn’t get a new driver’s license, or simply couldn’t order your Starbucks, the lives of millions of people were impacted. This is the fourth blog in my “What’s Old Is New” blog series.

The CrowdStrike Outage: What Happened?

As a slightly oversimplified recap: There was a problem with a specific file that was included in the latest software update to CrowdStrike’s Falcon product. Once installed, this update created a logic error that would freeze the computer, and that error would thereafter prevent the system from loading until fixed. The fix was as easy as manually going into the system and deleting that errant file so that the computer wouldn’t try and read the bad file during startup.  A skilled operator with the appropriate tools and credentials could fix this problem in a matter of minutes on any affected system. (Technical details here.)  Since this was a problem specific to CrowdStrike customers, Microsoft estimated that less than 1% of all Windows machines worldwide were impacted.

 “80 percent of my problems are simple logic errors. 80 percent of the remaining problems are pointer errors. The remaining problems are hard.”

Mark Donner, Google

I’m not here to talk about CrowdStrike or Microsoft.  However, I will talk about how I don’t think a problem like this would have happened 15 or 20 years ago when I was working as a System Administrator. The reason why is because the problem that CrowdStrike customers experienced that Friday is a problem that we dealt with all the time as system administrators, and we took preventative measures. It’s obvious to me (based on how widespread this issue was) that the kind of preventative measures that we used to take are still valid ones for companies to consider today.  

Over a decade ago when I was a System Administrator for a financial services firm, we would get patches for software all the time just like companies get today. Patches and updates are a commonplace occurrence, and I think they always have been. The difference for us then, however, was that we had strict service level agreements (SLA’s) with our customers with harsh penalties for system downtime. When patches came down as recommendations from any vendor, we loaded them onto test systems and let them run for a few days to see if they caused any problems.  Most of the time periodic patching didn’t cause any issues, but we were always careful to make sure that only stable patches got pushed out to the environment. There were also some vendors that were notorious for pushing out a patch, and then pushing out another patch to fix the patch a few days later.  Waiting to go to production avoided that nonsense too.  By having an established business practice of testing and waiting prior to pushing patches into production, we avoided all sorts of problems.  A company that followed this old-school practice might have completely missed any impact from this latest CrowdStrike issue.

A Proactive Approach

 “Vladimir: “Well? What do we do?”

Estragon: “Don’t let’s do anything. It’s safer”

Samuel Beckett, Waiting for Godot

I realize that taking this approach does take some extra effort, and switching to it would likely cause a strain on already- taxed it administrator resources for many companies.  However, if it avoids a situation that requires an administrator to log into every single computer in your company in order to rectify an issue, then I think it’s entirely possible that this old school “trust but verify” sort of approach to patching could pay for itself in the long run. Also, avoiding a widespread business interruption while your competitors are suffering from it could yield big dividends in terms of market share or increased revenue in some industries.

Let’s be clear here: I am not advocating avoiding patching systems, especially ones dealing with information security.  There’s also a delicate balance that needs to be struck between the desire to get software updates pushed out as quickly as possible, and the risk mitigation that comes from testing everything before it gets pushed into production.  There’s also probably extra work involved in pre-testing software updates, and many companies don’t want to commit the resources.  Also worth considering is the urgency of updates, if a patch is issued to deal with a specific vulnerability or exploit – most companies will want to fast-track those.

However, there is a lot of “routine” system patching that goes on every day at companies, and I think the benefits of delaying some patches for a day or two far outweigh the potential risks.  A lot of companies have streamlined this concept, and they wait until a particular day of the week to push out all of their patches, rather than sending them out immediately as they show up from various vendors. This approach, again, might have avoided systems impacts from a scenario comparable to what happened that Friday.

 “There is a charm, even for homely things, in perfect maintenance.”

Louis Auchincloss, American Lawyer and Novelist

I think this all comes down to understanding the way that various software products (and their patches) interact with your computing environment. That knowledge, combined with intelligent analysis of the risks presented by some applications and their patching processes, can give you insights on what precautions might be necessary to minimize downtime due to maintenance. 

Need some help breaking down and understanding the structure of some of the critical applications in your environment? Unsure of whether or not you should take a more proactive approach to application maintenance management? Sayers is here to help. Our team has decades of experience in helping companies like yours minimize the risks of making updates to their environment and maximizing uptime.

    Addresses

  • Atlanta
    675 Mansell Road, Suite 115
    Roswell, GA 30076
  • Boston
    25 Walpole Park South, Suite 12, Walpole, MA 02081
  • Rosemont
    10275 W. Higgins Road, Suite 470 Rosemont, IL 60018

 

  • Bloomington
    1701 E Empire St Ste 360-280 Bloomington, IL 61704
  • Chicago
    233 S Wacker Dr. Suite 9550 Chicago, IL 60606
  • Tampa
    380 Park Place, Suite 130, Clearwater, FL 33759

Have a Question?

Subscribe Contact us