Posts tagged as

3 posts

Cloud Platform Monitoring and Auto-Recovery Challenges - Part 2

The Complications and Strategies

In the first post of this two-part series, we introduced primary topics under the umbrella of cloud platform monitoring and went into a bit of detail for how they present specific challenges. In this follow-up post we’ll explore some of the state-of-the-art strategies for dealing with these issues and the additional complications that will arise when utilizing these techniques.

Cloud Platform Monitoring and Auto-Recovery Challenges - Part 1

Introduction to Cloud Monitoring

Most people who work in platform engineering and cloud infrastructure are aware that you need to design both your applications and your underlying platform for high availability and fault tolerance, but there is a large range of resiliency from “relatively reliable” to “bulletproof”. The common adage goes something like this; for each “additional 9” of reliability, you’ll need to spend an exponentially greater amount of effort and cost to achieve it.

Why is this? And what goes into these additional levels?

A Step-by-Step Guide to Calculate SLAs, SLIs, and SLOs for new SREs

Service Level Agreements (SLAs), Service Level Indicators (SLIs), and Service Level Objectives (SLOs) are critical metrics for measuring the performance and reliability of IT services. These metrics provide valuable insights into the quality of service provided to customers and help teams identify areas for improvement. In this blog post, we’ll provide a step-by-step guide to calculating SLAs, SLIs, and SLOs for your IT services, using an example of a microservices-based ecommerce application.