Cloud Platform Monitoring and Auto-Recovery Challenges - Part 1
Introduction to Cloud Monitoring
Most people who work in platform engineering and cloud infrastructure are aware that you need to design both your applications and your underlying platform for high availability and fault tolerance, but there is a large range of resiliency from “relatively reliable” to “bulletproof”. The common adage goes something like this; for each “additional 9” of reliability, you’ll need to spend an exponentially greater amount of effort and cost to achieve it.
Why is this? And what goes into these additional levels?