On October 21, local time in the United States, a major failure occurred in Amazon's AWS cloud service on Monday, triggering global chaos and causing some popular online services to become unavailable, flights to be delayed, and banks to be paralyzed. What is going on?

culprit

Amazon Web Services provides the tools and computing resources that make about one-third of the Internet operational.It provides storage space and database management so that enterprises do not need to maintain their own expensive infrastructure, while also directing user traffic to these platforms.

AWS' service sales approach can be summarized as: "Let us manage your enterprise's computing needs."

But on Monday, a very common piece of technology went terribly wrong: a Domain Name System (DNS) error, a common glitch.

People in the tech industry may be surprised to hear this. This common mistake can also cause huge confusion.

"It's always a DNS problem!" This is a saying often said in the industry.


Affected services

When someone clicks on an app or link, their device essentially sends a request to connect to the service.DNS is supposed to function as a map, but AWS lost its way on Monday. Platforms like Snapchat, Canva and HMRC are still there, but it can't see where they are and can't direct traffic to them.

Cause of failure

These failures can occur for a variety of reasons. Usually a maintenance issue or server failure. Sometimes it's human error, such as a misconfiguration somewhere, or in extreme cases it could be a cyberattack, although there's currently no evidence that this incident was the result of an attack.

Amazon AWS said the failure occurred at the company's massive data center region US-EAST-1 in northern Virginia, which is its oldest and largest data center cluster.

US-EAST-1 is one of AWS's busiest data centers, where many global applications and websites are hosted. The core of the problem lies in "DynamoDB API DNS resolution", which means that the system cannot correctly find the network address of a key database service called DynamoDB.Failure to parse can cause applications that rely on the database to be unable to access the data, triggering a chain reaction that can result in service interruptions or serious errors.This also explains why users are unable to access related services even though the underlying servers may still be running normally.


DNS

When DNS resolution is interrupted, the user's browser cannot locate the desired content, no matter how powerful the website or service's backend infrastructure is. This makes DNS a crucial but extremely sensitive link in network architecture. Any interference with DNS can cause large-scale network outages, affecting anything from a single website to an entire region's internet services. Amazon is currently working hard to fix this fundamental problem, but some services may still have "major errors" after the problem is resolved, and it will take time to return to normal.

This is also at least the third time in the past five years that Amazon's US-EAST-1 data center area has caused a large-scale Internet outage. Amazon did not explain why the data center had repeated problems.

dependent on a company

Many experts agree that Monday's incident is a perfect illustration of the risks of relying entirely on a single service provider for your business. As an industry giant, AWS carries the operational lifeline of millions of companies. The experts' views are certainly correct, but the problem is that there are very few service providers that can reach the same scale as AWS.

Experts and academics say the issue highlights the highly interconnected nature of everyday digital services and their reliance on a handful of global cloud service providers. A small failure can have a huge impact on business operations and daily life.

“This outage once again highlights our reliance on relatively fragile infrastructure.” Jake Moore, global cybersecurity consultant at European cybersecurity company ESET, said.

In the UK, Lloyds Bank and Bank of Scotland, as well as telecom service providers Vodafone and BT, were affected, according to outage tracking company Downdetector UK website. The website of HM Revenue and Customs was not immune.

"The main reason for the problem is that all these big companies rely on the same service provider," said Nishanth Sastry, director of research at the University of Surrey's computer science department.

Ookla, the company that owns Downdetector, said the incident resulted in more than 4 million users reporting service issues.

"For large enterprises, a few hours of cloud outage can mean millions of dollars in lost productivity and revenue," said Ryan Griffin, head of U.S. cyber practice at insurance brokerage McGill and Partners.

However, Wall Street's reaction was muted, with Amazon's stock price rising instead of falling, rising 1.6% to $216.48.