Posts tagged as

8 posts

Multi-Cloud: How to manage Complexity and Opportunity on day-1

There are several advantage of running services across multiple clouds. Many organization don't consider this initially due to added complexity, effort and management cost. Is there a way to have a cake and eat it too?

Day 2 Operations - SaaS Maintenance

In my previous posts, I have covered what it takes to build a SaaS Control Plane: [Capabilities][Billing][Provisioning][Scaling][High Availability][Monitoring]. In this post, we will cover some of the challenges in operating a SaaS service.

I’ve likely performed thousands of operational tasks during my career, but one sticks out clearly for me over a decade after I performed it. It was a major version upgrade of a decently-large Cassandra cluster while I was working at Signal, with between 8 and 64 nodes in each of 4 separate geographic regions in AWS - two in the US, one in Europe, and one in Japan. I think in total the cluster had about 160 nodes.

The upgrade included a data format change, which meant each node had to go through a lengthy step-by-step process including a drain, clean shutdown, data migration, software upgrade, startup, re-sync, and finally after passing health checks I could do the next node that shared the same token space. Nodes which didn’t share the same token space could in theory be upgraded at the same time, but we had to keep a specific percentage of nodes working in each region at the same time to maintain operational functionality.

SaaS Capabilities - What Does it Really Entail?

What does it really mean for software to be available as a service? Sure, there are plenty of dry definitions out there, but I think a lot of us like to adhere to the ideology of “I know it when I see it” even if we wouldn’t readily admit it. What I’d like to explore in this post is the makeup of what I personally would consider to be the table stakes features of a modern SaaS; and spend a bit of extra time going over the areas which you’re probably far more likely to gloss over. If you think I’ve missed anything, be sure to leave your thoughts in the comments below!

Challenges in Building SaaS Billing

By far the most consistent area of focus I’ve had in my career is monitoring and observability. Way back in 2005 while at Orbitz, I had the opportunity to learn from and work alongside Chris Davis, the original author of Graphite. It was an enormous success both at Orbitz and within the industry as a whole, and it’s been an honor and privilege to essentially ride that initial wave of monitoring innovation my whole career.

6 years later, another adjacent area emerged as my second-most common area of focus, and that is the collection of specific metrics used to quantify end-user usage, which are typically used to generate the bills for SaaS companies which have any sort of pay-per-use cost dimensions. Even including Omnistrate, I’ve now had direct involvement with the design and development of the metrics systems and/or billing integrations for the past 5 companies I’ve worked at over the past decade, and it’s incredible to me how similar the problems and solutions have been even when the industry or design of each business’ technical architecture has been so different from each other.

In today’s post I’d like to explore the internal complexities of SaaS billing systems and what challenges and design patterns for addressing them keep showing up. Hopefully by the end of this you’ll have a better understanding of how SaaS billing works and a blueprint for how to go about implementing it yourself if you need to.

Provisioning and Deployments - Your SaaS Foundation

Before the cloud, getting fresh hardware and deploying your new software was always done in big budget events involving procurement, finance, several IT teams and the whole process included no small amount of arguing most of the time.

Now that we have the ability to just press a button or make an API call to have a trove of shiny, powerful VMs added to our fleet at a moments’ notice, surely all the other problems regarding provisioning have been simplified as well, right?

Everything about Scaling

Scaling - it’s the reason we’re all using this cloud thing anyway, right? Surely all of your applications have been tested to effortlessly scale from 0 to 1,000 in milliseconds, and your databases can rebalance after scaling within minutes with zero impact to anything important, correct?

Don’t worry, I don’t think anyone has fully cracked this nut. But why is this? What makes it so difficult to actually get ALL the benefits of the infinitely flexible cloud?

Cloud Platform Monitoring and Auto-Recovery Challenges - Part 2

The Complications and Strategies

In the first post of this two-part series, we introduced primary topics under the umbrella of cloud platform monitoring and went into a bit of detail for how they present specific challenges. In this follow-up post we’ll explore some of the state-of-the-art strategies for dealing with these issues and the additional complications that will arise when utilizing these techniques.

Cloud Platform Monitoring and Auto-Recovery Challenges - Part 1

Introduction to Cloud Monitoring

Most people who work in platform engineering and cloud infrastructure are aware that you need to design both your applications and your underlying platform for high availability and fault tolerance, but there is a large range of resiliency from “relatively reliable” to “bulletproof”. The common adage goes something like this; for each “additional 9” of reliability, you’ll need to spend an exponentially greater amount of effort and cost to achieve it.

Why is this? And what goes into these additional levels?