Update 24/05/2023: As Lee James Gilmore pointed out on Twitter, the unit of deployment of CDK is the CDK app. A CDK app can contain multiple CloudFormation stacks and they can be changed together (in a single commit) and deployed together. And that’s absolutely fine and compatible with what I’m advocating for in this post. The point is to keep stateful and stateless resources together so that they’re deployed and rolled back together, like a transaction. In the CDK, that would mean keeping them in the same CDK app, even if you put them in separate stacks. In other frameworks, where the unit of deployment is the CloudFormation stack, then you should keep them in the same stack.
Update 31/08/2023: Big thanks to Samuel Lock for bringing this up. There are some caveats to my preference discussed below. 1) For shared infrastructure that doesn’t belong to any one application (such as VPCs and security groups), they should have their own repo and stack. They should also have their own deployment pipeline and are possibly owned by a different team. In large organizations, these shared infrastructures are often managed by the platform team. But in smaller organizations, that ownership is usually less well-defined. 2) With stateful resources such as RDS, you often need to share them across ephemeral environments. Otherwise, the cost for these RDS clusters would quickly multiply. I discussed this in more detail in another blog post here. Therefore, the RDS cluster becomes a kind of “shared” source. By the same logic as before, the RDS cluster should be in its own stack so it can be deployed independently of the application stack that depends on it. In this case, your ephemeral environment would only include the relevant database tables within an existing RDS cluster. You would be able to reuse the same RDS cluster across many ephemeral environments.
Loose coupling and high cohesion are two of the most essential software engineering principles. Unrelated things should stay apart, while related elements should be kept together.
These principles apply at all levels of our application — from the system-level architecture all the way down to individual modules or functions.
With this simple principle in mind, let’s talk about the contentious topic of whether you should keep stateful and stateless resources in the same CloudFormation stack.
I’m very much in the monolith stack camp. I prefer to keep stateful (databases, queues, etc.) and stateless (Lambda functions, API Gateway, etc.) resources together.
Arguments for monolith stack
Assuming the CloudFormation stack encapsulates an entire service, which includes both stateful and stateless resources, then it makes sense to define all the resources in a single CloudFormation stack. This makes managing and deploying the service easier in a number of ways:
- Resource reference is easy. You can use
!GetAttagainst any resources defined in the stack. For example, when you need to pass the name of a DynamoDB table to a Lambda function as an environment variable.
- You can update both the stateful and stateless resources in a single commit and deployment. For example, when you need to add a new DynamoDB table and add a new Lambda function to use it.
- CI/CD set-up is simpler. One service, one stack, one repo, one pipeline.
- It’s easy to create ephemeral environments. For example, when you need to work on a new feature, you can create a temporary environment with a single deployment. Using the Serverless framework, this is as simple as running
npx sls deploy --stage <stage-name>. When you’re done with the feature, simply delete the temporary environment.
Arguments for separate stacks
The counter-arguments of this approach usually evolve from these three points:
1. It’s less risky if you separate the stateful resources into their own stack. If someone accidentally deletes the stateless stack, at least you won’t lose data.
2. Stateful resources change less often than stateless resources. So deployments will be faster if you only need to deploy the stateless resources most of the time.
3. CloudFormation has a hard limit of 500 resources per stack. Moving the stateful resources out allows you to fit more stateless resources into the stack.
All these counter-arguments sound reasonable, but how much do they matter in practice and do they justify the extra complexity of having separate stacks?
Moving the stateful resources into their own stack doesn’t eliminate the risk of accidental deletion. It just moves the target. Someone fat-fingers the delete button on the stateful stack and it’s game over.
The right way to protect against accidental deletion is to enable
Termination Protection on the stack and/or set
Retain on the stateful resources.
There are other ways resources can be deleted accidentally. For example, when you change the name of a DynamoDB table, CloudFormation would replace the table during deployment. As you can see from the official documentation below.
So you should also consider setting the
Retain to protect against data loss from accidental changes. This particular risk is present regardless of which stack the stateful resources reside in.
By default, CloudFormation skips resources that haven’t changed. So if the stateful resources haven’t been updated, they have a negligible impact on the time it takes to deploy the stack.
In most cases, how long it takes to update an existing stack is a function of the number of stateless resources. For example, Lambda functions, IAM roles, API Gateway resources, etc.
To illustrate this, I collected data from three CloudFormation stacks.
Stack 1: 5 Lambda functions.
Stack 2: 5 Lambda functions, and 5 DynamoDB tables.
Stack 3: 20 Lambda functions.
Putting aside the initial deployment time, this is how long it took to update these stacks on average (no changes to the DynamoDB tables):
Stack 1: 46.4 sec
Stack 2: 46.4 sec
Stack 3: 55 sec
There was no difference in the average deployment time between Stack 1 and Stack 2. This is despite Stack 2 having 5 additional DynamoDB tables.
Stack 3 has many more Lambda functions, so it takes on average 10 more seconds to update each time.
CloudFormation resource limits
You can work around the 500 resources limit using nested stacks.
Because you typically have far fewer stateful resources than stateless resources, moving them out won’t buy you much space in most situations.
While this is a valid argument, in practice, it doesn’t really matter unless you have lots of stateful resources in your stack. And even then, using nested stacks is a better way to deal with the 500 resource limit anyway.
The stateful and stateless resources need to work together for our system to function. To achieve high cohesion, we should keep them together. Separating them into separate stacks violates one of the most important principles in software engineering and complicates things unnecessarily.
However, this is not a hard and fast rule. There will always be edge cases for which it makes some sense to split them. But in most cases, keeping them together is the best choice.
If you want to learn more about building serverless architecture, then check out my upcoming workshop where I will be covering topics such as testing, security, observability and much more.
Hope to see you there.