15.1 C
New York
Saturday, October 25, 2025
Array

How Can CIOs Keep Operations Going During an Outage?


For hours on Monday, millions of users and more than 1,000 companies found themselves unable to connect to the internet. Social media platforms Reddit and Snapchat were hit, as were banks Lloyds Bank and Halifax. Even kids were affected, with popular games Fortnite and Roblox knocked offline. Sen. Elizabeth Warren (D-Mass.) took to X, describing the event as one that broke “the entire internet” and calling for a breakup of Big Tech.

“Networking is certainly a foundational component of AWS services,” said Corey Beck, director of cloud technologies at DataStrike and a former senior solutions architect at AWS. “When it stumbles in a region like US-East-1, the effects go way beyond; it ripples through EC2, S3, DynamoDB, RDS, and pretty much every service that depends on them.” 

Yet for many others, it was business as usual. This is because the outage affected only AWS customers — and specific ones at that. The source of the outage was a DNS failure at the AWS data center cluster known as US-EAST-1. It’s the largest of the provider’s clusters, and one that powers much of AWS’s internet access — but not all of it. And any business or individual who runs Microsoft or Google products was not affected at all. 

The outage launched mass conversations, ranging from the standard narrative on overdependency on single providers to the need for better testing protocols before rollout. In an ideal world, this scale of disruption would never happen again. But CIOs can’t rely on crossed figures and dream scenarios. They need to determine what responsibility is on their shoulders when it comes to weathering a future outage — and decide whether the speed and efficiency gains of using a single provider will outweigh the concentration risk of relying on that major cloud vendor.

Related:Future-Proofing Cloud Security Priorities

Redundancy vs. Risk

While politicians discussed monopolies and users complained about website inaccessibility, IT leaders saw the outage as a call for better redundancy. The argument is quite clear: By building in backups and failover capacity, companies can spread out their reliance on any one point in their infrastructure. To not do so, some experts argued, would be operating at the edge. 

“Gamblers might choose to risk a core business capability by running it in a risky manner,” said Jon Brown, senior analyst for data protection, IT operations and sustainability at Omdia. “Personally, I’d advise on safety, as the failure of a poorly protected, high-profile, mission-critical application can lead to a resume-generating event, which most of us try to avoid. There is nothing more important than your customer and transaction data.”

This may seem obvious, but a thousand companies still lost digital functionality on Monday. Why weren’t they better prepared? One answer is that while redundancy isn’t new, it also isn’t very sexy. In a field full of innovation and growth, redundancy is about slowing down, checking your work, and taking the safest route. It’s not surprising if some companies are more excited about investing in new AI capabilities than implementing failsafe protocols. Nor is it necessarily wrong. 

“Sometimes, the smarter play is to accept limited disruption risk and redirect resources toward innovation, like AI or data modernization,” argued Chris Hutchins, founder and CEO of Hutchins Data Strategy Consulting. “But it must be an informed risk, not an assumed one.”

According to Hutchins, if there are areas of the business that CIOs can afford to pause in the event of a rare outage, the rewards from single-sourcing — cost savings, tighter integration and specialized expertise — may outweigh the operational risk. Tiago Azevedo, CIO at OutSystems, agreed on the need to see this as a financial calculation, made on an individual basis. Rather than being a default requirement, he said he sees redundancy as a targeted resilience investment. CIOs don’t need to protect every inch of their business to the same degree, as long as the key areas are substantially bolstered.

“The extent should reflect system criticality: production or customer-facing systems merit multi-region or multi-provider coverage, while development and test environments can tolerate brief downtime,” he said. “The objective isn’t to eliminate all risk but to align resilience spending with the potential cost of disruption.”

Mapping out the Mission-Critical

To determine where CIOs should direct redundancy efforts, IT leaders argued that there needs to be honesty and understanding around what aspects of infrastructure are actually fundamental to business operations. An outage can happen at any time, both within internal systems and at any third-party provider, meaning that CIOs can’t delay taking strategic action.

Over time, a company may be able to introduce redundancy at a more comprehensive level across all infrastructure, but this might not make the most financial sense. As Hitchens described it, “redundancy that isn’t tied to a clear recovery objective quickly becomes technical debt.”  So, it’s imperative that CIOs do an audit of their business dependencies, identifying single points of failure, and ordering systems based on their impact on operations and trust.

“It is important to invest where failure creates real risk, not just minor inconvenience, or noise,” he added. 

This will look different for companies of different sizes, but particularly for companies within different sectors. Some industries, such as healthcare or finance, require a higher level of redundancy across the board simply because the stakes are greater; lack of access to patient records or financial information could have severe repercussions in terms of safety and public trust, which are far beyond inconvenience or frustration.

Brown called out organizations that are “born in the cloud” as being particularly vulnerable, while Azevedo said he saw more pressure put on “always-on” industries such as e-commerce. Industries that are more highly regulated may also need to contend with greater expectations when it comes to resilience and redundancy; finance, for example. The EU recently passed DORA (Digital Operations Resilience Act) to ensure that financial entities can “withstand, respond to, and recover” from technology disruptions.

One Provider, but Diversified Dependencies

In the wake of the AWS outage, critics were quick to call for a diversification of internet partners, preaching the need for stronger and more numerous competitors to AWS. And as part of their redundancy strategies, CIOs will need to investigate how reliant they are on specific providers, so they can determine their risk in the event of an outage. 

But this isn’t as simple as tracing third-party contracts, counting how often one name appears, and shifting some operations away from too-dominant providers. If an organization has partnered predominantly with one provider, it’s probably for good reason. As Hitchens explained, working with a single provider can accelerate innovation and simplify management, offering visibility, native integrations and unified tooling. 

“The benefit is efficiency; the risk is dependency,” he said.

He added that he has no issue with CIOs continuing with single-provider strategies — as long as they govern them “with eyes wide open.” In practice, this may involve building portability into data, establishing exit and failover plans, and testing recovery outside the ecosystem.

Brown argued that the outage isn’t really a comment on the issue of the single provider in the first place; if organizations had built redundancy into their single-provider ecosystems, they could have avoided most of this disruption. This is because a single provider doesn’t need to equate to a single dependency. By utilizing different regions and availability zones, CIOs can spread their risk. After all, the AWS outage affected only US-EAST-1. Brown said he believes that this approach delivers 99% of the resilience benefits, while also being significantly more practical and cost-effective than a multi-provider strategy.

“Cross-provider failover sounds great on paper, but introduces substantial complexity,” he said. “The key is architecting for failure within your chosen ecosystem.”



Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Stay Connected

0FansLike
0FollowersFollow
0FollowersFollow
0SubscribersSubscribe
- Advertisement -spot_img

CATEGORIES & TAGS

- Advertisement -spot_img

LATEST COMMENTS

Most Popular

WhatsApp