SaaS Infrastructure: Why Terraform Isn't Enough

We think that Terraform is an excellent IaC tool if you want to setup your cloud infrastructure. It provides you a programmatic way to build infrastructure that can be replicated across regions. However, when it comes to SaaS, we feel that it falls short on many grounds:

  • SaaS deployments: If you have a need to deploy infrastructure per tenant, you will have to manage multiple state files, which is a big challenge.
  • Deploying across accounts: There is no support to manage different accounts for deployments natively
  • No Day-2 Infra support: Terraform is just limited to provisioning the infrastructure and leaves the big part of operating the infrastructure (from patching, monitoring, alerting, capacity planning, failure handling to evolution) to the SaaS providers
  • Multi-cloud support: Terraform requires manually creating and maintaining scripts for every cloud provider. Every time, there is any change, you have to manually update all the terraform scripts, run them appropriately across thousands of state files manually, handle any issues and manually fix them. At scale, this becomes quite unmanageable. In our previous experiences, we gave up on Terraform within a few months as we realized that it doesn't work at scale for SaaS use-case.
  • Not ACID compliant:
    • Not Atomic: In the case of any error during the apply phase, Terraform leaves the infrastructure in a broken state and leave it to DevOps to perfrom manual recovery. For SaaS with thousands and millions of tenants, this can be quite challenging. does not automatically rollback to the previous state. This may leave the infrastructure in a partially provisioned state.
    • Not Consistent: Due to the lack of basic recovery mechanisms, Terraform can leave the underlying infrastructure in an inconsistent state
    • Not Isolated: There is no built-in mechanism to run Terraform commands concurrently on the same state files. If two team members try to apply changes at the same time, they might face conflicts or undesirable outcomes. To avoid this, you will need to implement a state locking mechanism or follow certain operational practices.
    • Not Durable: By default, the state files are kept local and require explicit mechanism to store them durably for each tenant.
  • Drift Detection: Terraform struggles with drift detection, which means understanding if the actual state of resources has changed outside of Terraform since the last terraform apply. Terraform can refresh its state file before making changes to help mitigate this, but unexpected changes can still cause problems.
  • Manual: Operating with Terraform requires learning a new language, HCL (HashiCorp Configuration Language) and the learning curve for modular approach to manage large deployments is steep.
  • Performance issues: With large environments, Terraform can become slower to plan and apply changes.
  • Error handling: Terraform can be somewhat vague in the errors it produces, making it hard to debug complex scripts.

Omnistrate attempt to address the above IaC gaps for SaaS by providing a fully-automated solution. For more details on what we do, please see this page