Applying the Saga pattern with AWS Lambda and Step Functions

The Saga pattern is a pattern for managing failures, where each action has a compensating action for rollback.

In Hector Garcia-Molina’s 1987 paper, it is described as an approach to handling system failures in a long-running transactions.

It has become increasingly relevant in the world of microservices as application logic often needs to transact across multiple bounded contexts – each encapsulated by its own microservice with independent databases. Caitie McCaffrey gave a good talk on using the Saga pattern in distributed systems, which you can watch here.

Using Caitie’s example from her talk, suppose we have a transaction that goes something like this:

Begin transaction
    Start book hotel request
    End book hotel request
    Start book flight request
    End book flight request
    Start book car rental request
    End book car rental request
End transaction

We can model each of the actions (and their compensating actions) with a Lambda function, and use a state machine in Step Function as the coordinator for the saga.

Because the compensating actions can also fail so we need to be able to retry them until success, which means they have to be idempotent.

In the example below, we’ll implement backward recovery in the event of a failure.

Each Lambda function expects the input to be in the following shape.

{
  "trip_id": "5c12d94a-ee6a-40d9-889b-1d49142248b7",
  "depart": "London",
  "depart_at": "2017-07-10T06:00:00.000Z",
  "arrive": "Dublin",
  "arrive_at": "2017-07-12T08:00:00.000Z",
  "hotel": "holiday inn",
  "check_in": "2017-07-10T12:00:00.000Z",
  "check_out": "2017-07-12T14:00:00.000Z",
  "rental": "Volvo",
  "rental_from": "2017-07-10T00:00:00.000Z",
  "rental_to": "2017-07-12T00:00:00.000Z"
}

Inside each of the functions is a simple PutItem request against a different DynamoDB table. The corresponding compensating function will perform a DeleteItem against the corresponding table to rollback the PutItem action.

The state machine pass the same input to each action in turn:

  1. BookHotel
  2. BookFlight
  3. BookRental

and record their results at a specific path (so to avoid overriding the input $ that will be passed to the next function).

In this naive implementation, we’ll apply the compensating action for any failure – hence the State.ALL below. In practice, you should consider giving certain error types a retry – eg. temporal errors such as DynamoDB’s provision throughput exceeded exceptions.

Success Case

Following the happy path, each of the actions are performed in turn and the state machine will end successfully.

Failure Cases

When failures strike, depending on where the failure occurs we need to apply the corresponding compensating actions in turn.

In the examples below, if the failure happened at BookFlight, then both CancelFlight and CancelHotel will be executed to rollback any changes performed thus far.

Similar, if the failure happened at BookRental, then all three compensating actions – CancelRental, CancelFlight and CancelHotel – will be executed in that order to rollback all the state changes from the transaction.

Each compensating action also have an infinite retry loop! In practice, there should be a reasonable upper limit on the no. of retries before you alert for human intervention.

You can find the source code for this demo here.

Liked this article? Support me on Patreon and get direct help from me via a private Slack channel or 1-2-1 mentoring.
Subscribe to my newsletter


Hi, I’m Yan. I’m an AWS Serverless Hero and the author of Production-Ready Serverless.

I specialise in rapidly transitioning teams to serverless and building production-ready services on AWS.

Are you struggling with serverless or need guidance on best practices? Do you want someone to review your architecture and help you avoid costly mistakes down the line? Whatever the case, I’m here to help.

Hire me.


Check out my new course, Complete Guide to AWS Step Functions. In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, callbacks, nested workflows, design patterns and best practices.

Get Your Copy


Come learn about operational BEST PRACTICES for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more.

You can also get 40% off the face price with the code ytcui.

Get Your Copy


1 thought on “Applying the Saga pattern with AWS Lambda and Step Functions”

  1. Theoretically that is really nice, but in practice it causes a number of different issues. It also fails to sufficiently deal with the problem mentioned in the linked paper:

    However, unlike other transactions, the transactions m a saga are related to each other and should be executed as a (non-atomic) unit any partial executions of the saga are undesirable, and if they occur, must be compensated for

    The following problems occur:
    * Complex state management having to understand and organize all the possible states. What if we wanted to add two flight reservations or at the point of placing the car rental change the dates for the flight. Using a flow diagram is suboptimal.
    * Tasks for (Success, Rollback) for each type of transaction are separated, which mean in practice duplicating the management of the resources available or managed in the Happy Path.
    * Scaling in next to impossible to do the fixed flow. Take adding the ability to cancel an already successful path. What if the original step function didn’t include flight logic. Now you’ll have to handle that.
    * Repeatable idempotent transactions, while it is true that all the individual transactions look successful, it is impossible to know if the whole chain was successful without re-executing the flow. This causes consistency issues, for instance let’s say you have 2 tasks to perform:
    1. Order Widget
    2. Send Email to Customer

    What happens if email sending succeeds but the state machine traversal fails. That means you’ll think the process was unsuccessful, but really both the order and the email worked. That’s because there is a third hidden 3. Set state to success.

    What works better is a decentralized parallel transactions that self contain the successful and failure paths
    * API => Task Manager
    * Task Manager => FlightManager.Reserve Flight => onFailure => Cancel Filght
    * Task Manager => HotelManager.Reserve Hotel => onFailure => Cancel Hotel
    * Task Manager => CarManager.Reserve Car => onFailure => Cancel Car
    * Done

    Rather than the forced synchronous approach, this also provides a successful async approach. At any point you can observe the current state of the system via
    * API => Get State
    * Get State => FlightManager.GetState, HotelManager.GetState, CarManager.GetState

    and jump any where into the process when you want to:
    * API => Cancel Flight => FlightManager.CancelFlight

    Without worrying about the other parts of the flow. Since everything is self contained, you never to ask the question, hmmm, I’ve handled the Flight in the FlightManager, does the FlightManager need to do something else. The answer is you are always good because each Manager was designed to handle the full state machine with idempotency of its own resources.

Comments are closed.