What hyperscalers are hiding from you? Tell about recent waves of internet services outages.

Just for the last month we saw 2 global internet outages with AWS (us-east-1 region) and CloudFlare (global network outage) causing availability issues for many popular services e.g. X, ChatGPT, Spotify and even MC terminals. While the world moving forward with the AI adoption it’s not a coincidence it’s a consequence.

Postmortems were given here for the CloudFlare issues and here for the AWS’ one.

Infrastructure becomes increasingly complex day by day and human brain cannot handle this amount of changes. And on the other hand increase automation efforts require high qualified skills to deal with it. Though AI can help with that but also has a limit, especially for complex unique cases where it cannot.

The more complex the infrastructure gets the more an outage can last. Let’s see the explanations were given by experts to get some more insights on the issues.

Amazon (AWS team’s) explanation

The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that the automation failed to repair. 

CloudFlare explanation

The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems’ permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

What lesson can we learn from those statements?

  1. Large amount of automation that hard to control and reason about make the system harder to maintain.
  2. Overseen Edge cases can lead to failures. Almost not possible to avoid and requires professional testing efforts.
  3. Not enough staff (after massive lay offs) can cause the issues as lesser eyes to watch the systems running properly.
  4. Why failover mechanism has failed? Isn’t it supposed to recover the system in seconds especially for the critical services? That’s the thing the failover itself can cause this problems.
  5. How about heavily marketed by AWS so-called 5-9s availability where the system should be available 99.999% all the time. Seems that was just a PR action.
  6. Complexity so high that no one knows the system end-to-end and knowledge gaps are widening with an exponential scale.
  7. Security hardening also leads to errors (the case with Cloudflare). Again requires thorough testing.
  8. Race conditions. Same race condition that happens fast on device can happen very slow due to latency in distributed systems (the case with AWS).
  9. Cascade failures can occur when we have multiple interdependent services (the case with AWS).
  10. Latency between the state check and do action (AWS)
  11. Tech debt backlog keeps growing (common problem everywhere in IT). Check you tech debt – that’s your attack surface.
  12. Validation concerns. How you even know that configuration or change you apply is valid all the time if some bot applies it automatically? The answer is you won’t.
  13. Rigid systems that built around one central service and demand constant configuration updates.
  14. I’m not even talking about vendor lock in.

If you lucky you will get all the logs and traces after an issue and can evaluate the whole picture of the failure. But what if it’s a hardware failure and some data is completely lost (not rare, saw it many times) or logs are too obscure and misleading then what? You may never get to a postmortem stage or worse – make wrong conclusions.

Talking about conclusion..

Conclusion

Isn’t it that the internet should be distributed and fault tolerant? What can be done? P2P communication, blockchains, git-like internet? Or we should just pray that it will never happen again?

The thing is Internet is monopolised by a few hyperscalers now. Technology war race came to a point where it’s hard to pass down the collected knowledge to the new generation of techies making a knowledge gap (cough… tech debt). AI won’t help you because it’s not trained for that. All outages are unique.

The smarter you are the more complex systems you build. The harder to maintain it. There is no way out. The more complex systems become the more prone to errors and malicious attacks. Will see more global issues like this especially with hyperscalers like AWS, GCP, CloudFlare and Azure.

So far the solution can be an independent distributed internet services and cloud infrastructure. Not happening any time soon. Amen.

Cheers,

-VR


Discover more from Viacheslav Romanov

Subscribe to get the latest posts sent to your email.