Very early on Monday morning, October 20, 2025, Amazon Web Services (AWS) started reporting increased error rates and latency across servers based in the N. Virginia (us-east-1) region. AWS quickly discovered and isolated the cause of the issue: a Domain Name System (DNS) resolution problem.
Unfortunately, the problem snowballed into a global outage that disrupted hundreds of critical Amazon services and crashed applications across government websites, gaming networks, financial services, and social media platforms. Apps like GoDaddy, Outlook, Signal, Reddit, Snapchat, Apple Music, Office 365, and more have seen major service disruptions since this morning.
DNS Explained
A DNS operates like a directory for the internet; a phonebook for websites that translates human readable domain names or URLs like runonflux.com, into IP addresses that computers use to communicate with one another.
When a user searches for a specific URL, the device they’re browsing from (referred to as the client in this interaction) requests the URLs matching IP address from a DNS server. The DNS answers the request by sending the client the IP address for the URL that the user initially searched. The client receives the answer as the URL loads and the webpage appears.
Root Problem
A DNS resolution failure means that the DNS server isn’t returning IP addresses to clients and requests go unfilled meaning that URLs aren’t loading when users are trying to search applications or services.
Amazon’s DNS servers in the N. Virginia (us-east-1) region are some of the company’s most well-established computing centers and serve as an infrastructure backbone for many global third-party services.
When the region’s DNS server started experiencing lookup failures for client requests, devices were’t able to reliably turn URLs into IP addresses, so webpages, services, and apps failed to load even before connections were setup after user searches.
Amazon software developer kits that run apps and websites are programmed to automatically resend requests when a DNS lookup fails. Because of this, hundreds of requests were retried simultaneously, piling on latency and errors, creating a backlog of downstream queues and event processes.
Outage Implications
The Virgnia-centric DNS server issue escalated into a worldwide problem because the area is AWS’s oldest and largest operating region; it serves as a global hub for creating new cloud computing platforms with endpoints that support major AWS service features.
Additionally, there is a huge concentration of consumer-applications hosted on AWS servers in northern Virginia, creating a significant cloud chokepoints.
It’s too early to tell what the total cost impact of today’s outage will be. However, based on rising outage costs from 2024, initial losses from today’s disurption of apps and services could easily aggregate in eight-figure ranges.
Furthermore secondary impacts like customer churn and service-level agreement (SLA) penalties, and tirtiary impacts, such as interrupted transactions or delayed support tickets, will continue to compound over the coming days and weeks. All in all, major long-tail costs could likely bring up the total losses of this outage to a ballpark of $800 million.
Why this Would Never Happen on FluxCloud
AWS suffers from centralized, single points of failure where if one part of its computing network goes down, then it ripples across the entire ecosystem; something we witnessed today where an isolated DNS lookup failure occurring in the Virginia region, spiraled into a global problem.
FluxCloud, a decentralized and user-powered compute marketplace/network has no single-region anchor. Workloads are scheduled across a distributed pool of independent operators in varying regions.
Rather than relying on a centralized regional DNS server as a critical network and database endpoint, FluxCloud utilizes distributed hostname registries so if a DNS provider degrades within a specific zone or region, then clients can still discover services through an alternate DNS lookup or request path.
Additionally, because FluxCloud compute is provided by independent, cross-region operators, failures are compartmentalized. Distributed compute provision enables regionless-scaling and failover; one area’s uptime is not reliant on another so workloads can be placed, and quickly replaced if disruptions occur.
With FluxCloud, DNS discovery discovery is distributed, compute is decentralized across independent operators, and failures stay local.
Conclusion
Today’s outage was caused by a regular DNS lookup failure in servers based in the N. Virginia (us-east-1) region. Hundreds of resent client requests created backup logs of triage activity and connectivity disruptions in the Virginia-region, which snowballed into a global outage of hundreds of applications, services, and websites across the verticals of finance, IT, governemnt, entertainment, and more.
When the “internet address book” stopped reliably translating service hostnames and URLs into IP addresses, apps never reached a server, SDKs hammered retries, queues backed up, and a region that serves as a hub for cloud apps, escalated a local huccup into a global migraine.
We need to avoid single points of failure and cloud chokepoints caused by centralized networks. When network architecture and data transfer protocols are decoupled from one another, disruptions don’t become holistic. At the end of the day, we need Crash Fault Tolerant compute networks—a system that functions efficiently even after one of its components has gone down or failed.
Leave a Reply