In my previous posts, I have covered what it takes to build a SaaS Control Plane: [Capabilities][Billing][Provisioning][Scaling][High Availability][Monitoring]. In this post, we will cover some of the challenges in operating a SaaS service.
I’ve likely performed thousands of operational tasks during my career, but one sticks out clearly for me over a decade after I performed it. It was a major version upgrade of a decently-large Cassandra cluster while I was working at Signal, with between 8 and 64 nodes in each of 4 separate geographic regions in AWS - two in the US, one in Europe, and one in Japan. I think in total the cluster had about 160 nodes.
The upgrade included a data format change, which meant each node had to go through a lengthy step-by-step process including a drain, clean shutdown, data migration, software upgrade, startup, re-sync, and finally after passing health checks I could do the next node that shared the same token space. Nodes which didn’t share the same token space could in theory be upgraded at the same time, but we had to keep a specific percentage of nodes working in each region at the same time to maintain operational functionality.