A practical guide to testing AWS Step Functions

Yan Cui

I help clients go faster for less using serverless technologies.

This article is brought to you by

Hookdeck: The Serverless Event Gateway

Hookdeck is a reliable and scalable serverless event gateway for sending, receiving, authenticating, transforming, filtering, and routing events between services in your event-driven architecture.

Learn more

Testing Step Functions can be a daunting task. However, with a little preparation and effort, the testing process can be simplified and streamlined. In this article, we will provide a practical strategy on how to test Step Functions. So let’s start by setting the stage and introducing the players.

What makes Step Functions hard to test?

Step Functions are complicated by nature because they consist of many different parts that need to be tested individually. For example, a typical Step Functions state machine might include conditional branching that can direct the execution towards different paths. These paths all need to be tested so we can be sure they are working as expected.

We can also incorporate try-catch logic into our state machine to help us recover from errors that might occur at runtime. This logic also needs to be tested thoroughly to make sure it works properly.

Another important thing to note is that Step Functions work in a highly distributed environment. We can use Lambda functions to perform custom business logic, as well as integrate with over 200 AWS services directly. All of this functionality needs to be tested thoroughly to ensure it behaves properly when we launch our application into production.

What’s more, you can use callback patterns to suspend the execution while we wait for external processes to successfully complete before proceeding with our execution path. This pattern can also be very useful for integrating manual steps in an otherwise automated process. For example, to allow a human operator to approve or reject a request to deploy our application after the commit is pushed to GitHub.

Finally, you have the Wait states. Which can add a configurable amount of delay to the execution and makes end-to-end testing difficult. For example, if there is a Wait state that waits for an hour, would your test need to sleep that long as well? And would you be happy to wait an hour to run every test case in your test suite? Probably not!

All of these factors make Step Functions notoriously difficult to test. But there are ways to overcome these challenges (at least, to an extent) and produce high-quality test cases.

Testing with Step Functions Local

Step Functions Local is a local simulator for Step Functions and can execute our state machines locally. I generally avoid local simulators (such as localstack) because they are usually more trouble than they are worth. However, I make an exception for Step Functions Local because its mocking capability is almost a necessity if you want to achieve a good test coverage for Step Functions.

You see, by default, Step Functions Local executes the Task states against the real AWS services (e.g. S3 or DynamoDB). We can override the service endpoints for a number of AWS services in the configuration.

Using this feature we can replace these services with local mocks while executing our tests. While this might seem useful, only a small number of services are supported and it would mean we have to use other tools to simulate these AWS services.

More importantly, Step Functions Local lets us mock the output of Task states. These mock responses have to be provided when you start Step Functions Local. But we can use them to help drive the state machine execution down a particular path that we want to test. For example, by throwing the right error at a Task state to help us test the error path in our state machine. Or by providing the right output from a Task state that feeds into a Choice state to make sure we’d follow the right branch.

Problems with Step Functions Local

While it’s a useful tool and one that you should have in your arsenal, there are still a few notable problems with it.

1. There is no guarantee that the mocked calls will work in the exact way they will on production. This is a problem with all mocks and local simulators. That they are not a perfect replica of the real execution environment and are prone to feature lags and false negatives or positives.

2. It doesn’t simulate IAM and therefore it can’t help us catch permission-related errors in our state machine. This is important, especially in a complex state machine that interacts with many AWS services through direct service integrations.

3. It doesn’t support fast-forwarding through a Wait state or simulating a Task timeout.

4. Last and perhaps most importantly, it doesn’t support CloudFormation references. When you create a state machine against Step Functions Local, the ASL must use ARNs instead of CloudFormation references because the state machine is not deployed as part of a CloudFormation stack. This means that you cannot use references to other resources in the state machine. For example, you couldn’t reference an Amazon S3 bucket if you wanted your state machine to write data to it.

This last point creates friction in your development workflow. It means “test before you deploy” is not really viable without first deploying the project and creating those resources referenced by the state machine.

End-to-End testing

To run end-to-end tests, we would deploy the project to AWS and create the state machine and all the resources that it references. Then we would execute the state machine with different inputs to cover different paths.

However, it’s often difficult or impossible to cover all the execution paths using end-to-end tests. For example, a branch logic might depend on the result of an API call to a third-party API such as Stripe or Paypal. Or perhaps an error path relies on DynamoDB throwing an error. These are just a couple of examples of scenarios that we can’t easily cover using end-to-end tests.

For some of these scenarios, we can use mock APIs and return dummy results for our branch logic. For example, you can use Apidog to host a mock Stripe API to test the payment flow from your state machine. You can also host a local mock API and expose it publicly using ngrok.

In both cases, we are not actually interacting with the real third-party API. So any failure in our branch logic won’t impact our customers as they won’t receive a failed payment on a real credit card. But the mock API will let us know that our branch logic is failing when it shouldn’t.

However, we do need a way to “convince” our state machine to use the mock API instead of the third party’s real API. One way to do this is to add the mock API URL to the execution input, as an override, and have our Lambda functions use it whenever it’s specified.

Component testing for the Lambda functions

If a state machine consists primarily of Lambda functions, then we can test each function separately using component-based testing techniques:

  •  Encapsulate the domain logic into its own modules and write unit tests for them.
  •  Write integration tests (aka sociable tests) that invoke the Lambda function code locally but have it talk to the real AWS services (no mocks or local simulators!). These tests allow you to iterate quickly on the functional code without having to deploy it to AWS after every change.
  •  After you’ve gained confidence from the local tests, deploy everything to AWS and test the Lambda functions as part of the end-to-end tests against the state machine.

To learn more about this approach to testing Lambda functions, check out the following blog post: My strategy for testing serverless applications.

Or, if you want a more in-depth walkthrough and see this in action, then check out my new course “Testing Serverless Architectures”.

A strategy for testing Step Functions

Ok, so far we’ve covered the challenges with testing Step Functions and presented three approaches for testing them:

  • Using Step Functions Local.
  • Using end-to-end tests.
  • Component testing on individual Lambda functions.

Let’s combine them to come up with a strategy that gives us the best from all three approaches.

First, use component testing for the individual Lambda functions.

Then, try to cover as many of the execution paths as possible using end-to-end tests. However, remember that end-to-end tests can’t cover all the possible execution paths for a state machine. Or, at the very least, it’ll be very challenging to achieve 100% coverage with end-to-end tests. So, there are likely some gaps in our test coverage. For example, some hard-to-reach Choice branches and error paths.

Finally, use Step Functions Local and mock responses to fill in the gaps in the end-to-end tests’ coverage. Where we are not able to direct the end-to-end tests towards a test case we want to execute, we can instead use the mocked responses in Step Functions Local to drive the state machine towards those execution paths.

But wait!

Isn’t the point of Step Functions Local to let us test our state machines locally without deploying them to AWS?

In practice, that is really hard to achieve because it doesn’t support CloudFormation references. Instead, I find the best way to use Step Functions Local is to bridge the gaps in our end-to-end tests.

In this case, I would:

  1. Deploy the state machine and all the other resources it depends on to AWS. The deployed state machine would contain the fully qualified ARNs instead of CloudFormation references.
  2. Start Step Functions Local with the mock responses required for my test cases.
  3. Create the state machine against Step Functions Local, using the definition of the deployed state machine.
  4. Execute test cases against Step Functions Local.

Combining end-to-end tests with Step Functions Local like this would give you almost 100% coverage of all the execution paths.

However, there might still be some gaps left in our test coverage. Specifically, when Wait states and Task timeouts are concerned. Because it’s not feasible to write test cases that would have to wait indefinitely, and Step Functions Local doesn’t support fast-forwarding through these Wait states.

The only viable solution that I have come up with is to use Step Functions Local and rewrite the relevant Wait states to wait for only a second. We can do this in step 3 above when we create the state machine against Step Functions Local. The same can be done to make timeouts shorter as well and therefore make it feasible to test the error paths. To learn more about this approach, check out my follow-up post.

Wrap up

The strategy I outlined above gives us the best of both worlds — end-to-end testing combined with local execution of some test cases that require special attention.

I hope that this post was useful to you. If you want to learn more about testing serverless architectures, including Step Functions, then you should check out my new course “Testing Serverless Architectures”.

Hope to see you in the course :-)

Whenever you’re ready, here are 4 ways I can help you:

  1. Production-Ready Serverless: Join 20+ AWS Heroes & Community Builders and 1000+ other students in levelling up your serverless game. This is your one-stop shop for quickly levelling up your serverless skills.
  2. Do you want to know how to test serverless architectures with a fast dev & test loop? Check out my latest course, Testing Serverless Architectures and learn the smart way to test serverless.
  3. I help clients launch product ideas, improve their development processes and upskill their teams. If you’d like to work together, then let’s get in touch.
  4. Join my community on Discord, ask questions, and join the discussion on all things AWS and Serverless.