![]() ![]() The criticality of these data centers can clearly be seen in the volume of successful HTTP requests we handled globally:Įven though these locations are only 4% of our total network, the outage impacted 50% of total requests. This was delayed as network engineers walked over each other's changes, reverting the previous reverts, causing the problem to re-appear sporadically. Work begins to revert the problematic change.Ġ7:42: The last of the reverts has been completed. This is when the incident started, as this swiftly took these 19 locations offline.Ġ6:32: Internal Cloudflare incident declared.Ġ6:51: First change made on a router to verify the root cause.Ġ6:58: Root cause found and understood. None of our locations are impacted by the change, as these are using our older architecture.Ġ6:17: The change is deployed to our busiest locations, but not the locations with the MCP architecture.Ġ6:27: The rollout reached the MCP-enabled locations, and the change is deployed to our spines. We have backup procedures for handling such an event and used them to take control of the affected locations.Ġ3:56 UTC: We deploy the change to our first location. While deploying a change to our prefix advertisement policies, a re-ordering of terms caused us to withdraw a critical subset of prefixes.ĭue to this withdrawal, Cloudflare engineers experienced added difficulty in reaching the affected locations to revert the problematic change. ![]() A change in policy can mean a previously advertised prefix is no longer advertised, known as being "withdrawn", and those IP addresses will no longer be reachable on the Internet. The end result is that any given prefixes will either be advertised or not advertised. These policies have individual components, which are evaluated sequentially. As part of this protocol, operators define policies which decide which prefixes (a collection of adjacent IP addresses) are advertised to peers (the other networks they connect to), or accepted from peers. In order to be reachable on the Internet, networks like Cloudflare make use of a protocol called BGP. ![]() As these locations also carry a significant proportion of the Cloudflare traffic, any problem here can have a very wide impact, and unfortunately, that’s what happened today. This new architecture has provided us with significant reliability improvements, as well as allowing us to run maintenance in these locations without disrupting customer traffic. This layer is represented by the spines in the following diagram. This mesh allows us to easily disable and enable parts of the internal network in a data center for maintenance or to deal with a problem. In this time, we’ve converted 19 of our data centers to this architecture, internally called Multi-Colo PoP (MCP): Amsterdam, Atlanta, Ashburn, Chicago, Frankfurt, London, Los Angeles, Madrid, Manchester, Miami, Milan, Mumbai, Newark, Osaka, São Paulo, San Jose, Singapore, Sydney, Tokyo.Ī critical part of this new architecture, which is designed as a Clos network, is an added layer of routing that creates a mesh of connections. Over the last 18 months, Cloudflare has been working to convert all of our busiest locations to a more flexible and resilient architecture. This was our error and not the result of an attack or malicious activity. In other locations, Cloudflare continued to operate normally. At 06:58 UTC the first data center was brought back online and by 07:42 UTC all data centers were online and working correctly.ĭepending on your location in the world you may have been unable to access websites and services that rely on Cloudflare. A change to the network configuration in those locations caused an outage which started at 06:27 UTC. This outage was caused by a change that was part of a long-running project to increase resilience in our busiest locations. Unfortunately, these 19 locations handle a significant proportion of our global traffic. Today, June 21, 2022, Cloudflare suffered an outage that affected traffic in 19 of our data centers. This post is also available in Deutsch, Français, 简体中文, 繁體中文, 日本語, 한국어, Español and ไทย. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |