How to Mitigate the Risk of Systemic Software Failure

February 27, 2025

171

The CrowdStrike outage was a vivid reminder of how interconnected the world’s systems are – and how dramatically every organization can be affected. So in the aftermath of the outage, every C suite needs to be asking: how do we mitigate the risk of systemic software failure in our organization?

I want to explore what I think are some of the answers in this blog. But first, it’s helpful to understand how and why we became so reliant on third party software in the first place.

The Original Software Development Lifecycle

Not too many years ago, the software development lifecycle (SDLC) took months, if not years.

Software was installed on-premise. We would only deploy it after extensive and exhaustive testing.

Each time we wanted to upgrade the software to a newer, stable release, we’d go through the same process again. Some organizations did this each year. Many more waited several years because the investment in time and money was too much to consider on an annual basis.

The process was incredibly costly and incredibly inefficient. On the other hand, we were in control of our own destinies when something went wrong.

What changed?

Every Company is a Tech Company

Well, everything changed.

In the past, an organization might have had a single ERP system.

Today, multiple software tools underpin every business. We have the core IT we depend on to ‘keep the lights on’. We also have the IT that each individual department or business unit uses – manufacturing apps, product design tools, customer support portals and so on, each one likely talking to the others.

As the saying goes, every company is a tech company now.

The CIO is still expected to have oversight of all the tools in use. The size of the task has altered out of all recognition. Yet, at the same time, their task each year is to achieve all three of better, faster, and cheaper.

In today’s world, time and resource-intensive ways of working are no longer viable. It simply isn’t feasible to have the same monolithic update processes we used when we only relied on a handful of systems.

SaaS Helps Us Achieve Better, Faster, and Cheaper

The software as a service (SaaS) model provided the answer that was needed. Essentially, the model allows us to outsource maintenance and updates of software to third parties.

It has helped CIOs achieve the seemingly impossible and deliver the holy grail of better, faster, and cheaper.

Consequently, it’s been enthusiastically embraced by business. Research suggests that in 2022, a typical organization used 130 SaaS applications.

Because updates are rolled out monthly, weekly, or even daily, organizations are always harnessing the best the technology has to offer.

On the other hand, when something goes wrong, we can no longer fix it by going down to the basement and ‘turning it off and on again’.

When there’s a bug, the IT team is dependent on the third party to find it, fix it, and roll out a revised update.

When an upgrade causes conflicts with other systems, IT teams are forced to be reactive, developing a solution that will resolve the clash.

When a tool used by a large percentage of the world goes wrong, chaos ensues, as we saw recently.

The SaaS model undoubtedly brings huge benefits. The modern world of business couldn’t and wouldn’t exist without it. But organizations are no longer as proactive or in control as they would like to be.

So what’s the answer?

Resilience is the Answer

For me, we need to focus on resilience in IT and put in place the processes that will allow us to take back control. Here are four ways to think about it.

Fail Forward

Firstly, it’s a fact of life that in a world where we’re all dependent on software there will always be bugs. The only variable is how serious they are.

The task, therefore, is to understand what our options are if something fails.

If you’re a company that releases software, what’s your testing strategy? What’s your rollout strategy? How do you revert to a previous stable release if something goes wrong?

If you’re a company that relies on software, what are your options when something fails? What’s your fallback position?

Have Someone Play Devil’s Advocate

The dangers of group-think are well-documented. In any decision-making process, including software investment decisions, make sure there’s someone playing devil’s advocate. Why do we need this software? What are the alternatives? What due diligence have we done on the provider?

Prioritize Transparency

The CrowdStrike outage showed very clearly that organizations rely on systems that rely on other systems that rely on other systems. Entire infrastructures depend on modems in data centers that people rarely – if ever – visit.

We need to demand a new level of openness and transparency that allows us to look ‘under the hood’ rather than trusting our providers to look after it.

As part of this, we should remember that cheaper rarely means better. We must be confident that a lower price doesn’t mean lower standards.

Introduce Quality Resilience Engineering

Finally, I think there’s scope for an entire new role, one that’s tasked with engineering quality into our systems and developing the back-up plan for when things go wrong.

On a day-to-day basis, they’re using tools such as Eggplant Monitoring & Eggplant Test to stay on top of their testing.

At a strategic level, they’re looking beyond SDLC and IT operations management to the bigger picture of an organization and its systems.

Their role is to put organizations back in control.

Not Old Days or New Days, Just Different Days

Our modern tech-enabled world depends on huge numbers of software systems. We can’t go back to the ‘old days’ where we had complete control – and nor would we want to. But we do have to think about how we can build resilience into our systems and minimize the risks at both a systemic level and an organizational level. I hope my four suggestions provide food for thought as we pursue this quest.

Source link

Previous articleStripe CEO says AI startups are growing faster than SaaS ever did and calling them wrappers ‘misses the point’

Next articleUK house prices have risen three times faster than flats since 2020

How to Mitigate the Risk of Systemic Software Failure

The Original Software Development Lifecycle

Every Company is a Tech Company

SaaS Helps Us Achieve Better, Faster, and Cheaper

Resilience is the Answer

Fail Forward

Have Someone Play Devil’s Advocate

Prioritize Transparency

Introduce Quality Resilience Engineering

Not Old Days or New Days, Just Different Days

Related Articles

Bluesky CEO steps down as the app focuses on growth

X shares data on women’s usage trends

Snapchat adds new tools to help brands tap into India’s cricket season

LEAVE A REPLY Cancel reply

CATEGORIES & TAGS

LATEST COMMENTS

Most Popular

Understanding Plex UDP Amplification DDoS Attack

Major Tech Layoffs in 2024: An Updated Tracker

Addressing the Skills Gap to Keep Up with the Evolution of the Cloud

How Automotive Radars Are Advancing Safety Features

What Can IT Executives Do to Improve Mental Health for Themselves and Their Teams?

How to Mitigate the Risk of Systemic Software Failure

The Original Software Development Lifecycle

Every Company is a Tech Company

SaaS Helps Us Achieve Better, Faster, and Cheaper

Resilience is the Answer

Fail Forward

Have Someone Play Devil’s Advocate

Prioritize Transparency

Introduce Quality Resilience Engineering

Not Old Days or New Days, Just Different Days

Related Articles

LEAVE A REPLY Cancel reply

Stay Connected

CATEGORIES & TAGS

LATEST COMMENTS

Most Popular