Amazon is linking site hiccups to AI efforts

March 11, 2026

119

Amazon reportedly convened an engineering meeting Tuesday to discuss “a spate of outages” that are tied to the use of AI tools, according to a report in the Financial Times.

“The online retail giant said there had been a ‘trend of incidents’ in recent months, characterized by a ‘high blast radius’ and ‘gen-AI assisted changes’” according to a briefing note for the mandatory meeting, the FT said. “Under ‘contributing factors,’ the note included ‘novel genAI usage for which best practices and safeguards are not yet fully established.’”

The story quoted Dave Treadwell, a senior vice-president in the Amazon engineering group, as saying in the note that “junior and mid-level engineers will now require more senior engineers to sign off any AI-assisted changes.”

However, said Chirag Mehta, principal analyst for Constellation Research, the senior engineer sign-off idea may inadvertently undo the key benefit of the AI strategy: efficiency.

“If every AI-assisted change now needs a senior engineer staring at diffs, the enterprise gives back much of the speed benefit it was chasing in the first place,” Mehta said. “The real fix is to move review upstream and make it machine-enforced: policy checks before deployment, stricter blast-radius controls for high-risk services, mandatory canarying, automatic rollback, and stronger provenance so teams always know which changes were AI-assisted, who approved them, and what production behavior changed afterward.”

The requirement for approvals follows several AI-related incidents that took down Amazon and AWS services, including a nearly six hour long Amazon site outage earlier this month, and a 13-hour interruption of an AWS service in December.

Glitches inevitable

Analysts and consultants said it is hardly surprising that enterprises such as Amazon are discovering that non-deterministic systems deployed at scale will create embarrassing problems. Humans in the loop is a fine approach, but there have to be enough humans to reasonably handle the massive scope of the deployment. In healthcare, for example, telling a human to approve 20,000 test results during an eight-hour shift is not putting meaningful controls in place. It is instead setting up the human to take the blame for the inevitable test errors.

Acceligence CIO Yuri Goryunov stressed that glitches like these were always inevitable.

“To me, these are normal growing pains and natural next steps as we’re introducing a newish technology into our established workflows. The benefits to productivity and quality are immediate and impressive,” Goryunov said. “Yet there are absolutely unknown quirks that need to be researched, understood and remediated. As long as productivity gains exceed the required remediation and validation work within the agreed upon parameters, we’ll be OK. If not, we’ll have to revert to legacy methods for that particular application.”

‘Reckless’ strategy

However, Nader Henein, a Gartner VP analyst, said that he expects the problem to get worse.

“These kinds of incident will continue to happen with more frequency. The fact is that most organizations think they can drop in AI-assisted capabilities in the same way that they can drop in a new employee, without changing the surrounding structure,” Henein said. “When we hand an AI system a task and a rulebook, we might think we’ve got things locked down. But the truth is, AI will do whatever it takes to achieve its goal within those rules, even if it means finding creative and sometimes alarming loopholes.

“It’s not that AI is malicious. It’s just that it doesn’t care. It doesn’t have the boundaries, the empathy, or the gut check that most people develop over time.”

In view of this, said Flavio Villanustre, CISO for the LexisNexis Risk Solutions Group, the typical enterprise AI strategy is “reckless.”

“You could consider the AI system as some sort of genius child with little and unpredictable sense for safety, and you give it access to do something that could cause significant harm on the promise of performance increase and/or cost reduction. This is close to the definition of recklessness,” Villanustre said.

“As a minimum, if you did this in a traditional manner, you would try this in a test environment independently, verify the results, and then migrate the actions to the production environment,” he noted. “Even though adding a human in the loop can slow things down and somewhat decrease the benefits of using AI, it is the correct way to apply this technology today.”

Other practical tactics

However, the human in the loop isn’t a complete solution. There are other practical tactics that help minimize AI exposure, said cybersecurity consultant Brian Levine, executive director of FormerGov.

“Traditional QA processes were never designed for systems that can generate novel errors no human has ever seen before. That’s why simply adding more human oversight doesn’t solve the problem. It just slows everything down while the underlying risk remains,” Levine said. “AI introduces a new category of failure: unknown‑unknowns at machine speed. These aren’t bugs in the traditional sense. They are emergent behaviors. You can’t patch your way out of that.”

Even worse, Levine argued, is that these bugs beget far more bugs.

“AI doesn’t just make mistakes. It makes mistakes that propagate instantly. Enterprises need a separate deployment pipeline for AI‑assisted changes, with stricter gating and automated rollback triggers,” he said. “If AI can write code, your systems need the equivalent of financial‑market circuit breakers to stop cascading failures. This means automated anomaly detection that halts deployments before customers feel the impact.”

He noted that the goal isn’t to watch AI more closely, it’s to give it “fewer ways to break things.” Techniques such as sandboxing, capability throttling, and guardrail‑first design are far more effective than trying to manually review every change.

Levine added: “AI can accelerate development, but your core infrastructure should always have a human‑authored fallback. This ensures resilience when AI‑generated changes behave unpredictably.”

Need a separate operating model

Manish Jain, a principal research director at Info-Tech Research Group, agreed. The Amazon situation is not as much evidence that AI makes more mistakes as it is evidence that AI now operates at a scale where even small errors can have “a massive blast radius” and may pose “an existential threat” to the organization.

“The danger isn’t that AI may make mistakes,” he said. “The danger is that it compresses the time humans have to intervene and correct a disastrous trajectory. With the advent of agentic AI, time‑to‑market has dropped exponentially. Governance, however, has not evolved to contain the risks created by this pace of technological acceleration.”

Jain stressed, however, that adding people into the mix is not, on its own, a fix. It has to be done reasonably, which means making an honest guess how much one human can oversee meaningfully.

“Putting a human in the loop sounds prudent, but it is not a panacea,” Jain said. “At scale, the loop soon spins faster than the human. Human in the loop cannot be the hammer for every agentic AI nail. It must be complemented by human‑over‑the‑loop controls, informed by factors such as autonomy, impact radius and irreversibility.”

Mehta added, “AI changes the shape of operational risk, not just the amount of it. These systems can produce code or change instructions that look plausible, pass superficial review, and still introduce unsafe assumptions in edge cases.

“That means companies need a separate operating model for AI-assisted production changes, especially in checkout, identity, payments, pricing, and other customer-critical paths. Those are exactly the kinds of workflows where the tolerance for experimentation should be extremely low.”

Previous articleSnapchat drives TV viewership, promotions

Next articleLinkedIn is a leading source for AI answers

Amazon is linking site hiccups to AI efforts

Glitches inevitable

‘Reckless’ strategy

Other practical tactics

Need a separate operating model

Related Articles

LinkedIn lets users set core brand rules

SAP WM to SAP EWM Public Cloud: What Changes in 2026 and How to Prepare – Fingent

Angular Signals in practice: Building a signal-first form in Angular

LEAVE A REPLY Cancel reply

CATEGORIES & TAGS

LATEST COMMENTS

Most Popular

Oracle under fire for its handling of separate security incidents

These fintech companies are hiring in 2025 after a turbulent year

8 Lessons That Helped Me Lead Remote Teams with Trust, Inclusion, and Results | by Subhasis Ghosh | The Startup | Apr, 2025

It’s Time To Stop Doing Feature Requests

Choosing the Right SAP Implementation Partner: What Businesses Need to Know

Amazon is linking site hiccups to AI efforts

Glitches inevitable

‘Reckless’ strategy

Other practical tactics

Need a separate operating model

Related Articles

LEAVE A REPLY Cancel reply

Stay Connected

CATEGORIES & TAGS

LATEST COMMENTS

Most Popular