Posts tagged as

2 posts

Cloud Platform Monitoring and Auto-Recovery Challenges - Part 2

The Complications and Strategies

In the first post of this two-part series, we introduced primary topics under the umbrella of cloud platform monitoring and went into a bit of detail for how they present specific challenges. In this follow-up post we’ll explore some of the state-of-the-art strategies for dealing with these issues and the additional complications that will arise when utilizing these techniques.

Cloud Platform Monitoring and Auto-Recovery Challenges - Part 1

Introduction to Cloud Monitoring

Most people who work in platform engineering and cloud infrastructure are aware that you need to design both your applications and your underlying platform for high availability and fault tolerance, but there is a large range of resiliency from “relatively reliable” to “bulletproof”. The common adage goes something like this; for each “additional 9” of reliability, you’ll need to spend an exponentially greater amount of effort and cost to achieve it.

Why is this? And what goes into these additional levels?