Applying the Saga pattern with AWS Lambda and Step Functions

You can become a serverless blackbelt. Enrol to my 4-week online workshop Production-Ready Serverless and gain hands-on experience building something from scratch using serverless technologies. At the end of the workshop, you should have a broader view of the challenges you will face as your serverless architecture matures and expands. You should also have a firm grasp on when serverless is a good fit for your system as well as common pitfalls you need to avoid. Sign up now and get 15% discount with the code yanprs15!

The Saga pattern is a pattern for managing failures, where each action has a compensating action for rollback.

In Hector Garcia-Molina’s 1987 paper, it is described as an approach to handling system failures in a long-running transactions.

It has become increasingly relevant in the world of microservices as application logic often needs to transact across multiple bounded contexts – each encapsulated by its own microservice with independent databases. Caitie McCaffrey gave a good talk on using the Saga pattern in distributed systems, which you can watch here.

Using Caitie’s example from her talk, suppose we have a transaction that goes something like this:

Begin transaction
    Start book hotel request
    End book hotel request
    Start book flight request
    End book flight request
    Start book car rental request
    End book car rental request
End transaction

We can model each of the actions (and their compensating actions) with a Lambda function, and use a state machine in Step Function as the coordinator for the saga.

Because the compensating actions can also fail so we need to be able to retry them until success, which means they have to be idempotent.

In the example below, we’ll implement backward recovery in the event of a failure.

Each Lambda function expects the input to be in the following shape.

  "trip_id": "5c12d94a-ee6a-40d9-889b-1d49142248b7",
  "depart": "London",
  "depart_at": "2017-07-10T06:00:00.000Z",
  "arrive": "Dublin",
  "arrive_at": "2017-07-12T08:00:00.000Z",
  "hotel": "holiday inn",
  "check_in": "2017-07-10T12:00:00.000Z",
  "check_out": "2017-07-12T14:00:00.000Z",
  "rental": "Volvo",
  "rental_from": "2017-07-10T00:00:00.000Z",
  "rental_to": "2017-07-12T00:00:00.000Z"

Inside each of the functions is a simple PutItem request against a different DynamoDB table. The corresponding compensating function will perform a DeleteItem against the corresponding table to rollback the PutItem action.

The state machine pass the same input to each action in turn:

  1. BookHotel
  2. BookFlight
  3. BookRental

and record their results at a specific path (so to avoid overriding the input $ that will be passed to the next function).

In this naive implementation, we’ll apply the compensating action for any failure – hence the State.ALL below. In practice, you should consider giving certain error types a retry – eg. temporal errors such as DynamoDB’s provision throughput exceeded exceptions.

Success Case

Following the happy path, each of the actions are performed in turn and the state machine will end successfully.

Failure Cases

When failures strike, depending on where the failure occurs we need to apply the corresponding compensating actions in turn.

In the examples below, if the failure happened at BookFlight, then both CancelFlight and CancelHotel will be executed to rollback any changes performed thus far.

Similar, if the failure happened at BookRental, then all three compensating actions – CancelRental, CancelFlight and CancelHotel – will be executed in that order to rollback all the state changes from the transaction.

Each compensating action also have an infinite retry loop! In practice, there should be a reasonable upper limit on the no. of retries before you alert for human intervention.

You can find the source code for this demo here.

Liked this article? Support me on Patreon and get direct help from me via a private Slack channel or 1-2-1 mentoring.
Subscribe to my newsletter

Hi, I’m Yan. I’m an AWS Serverless Hero and I help companies go faster for less by adopting serverless technologies successfully.

Are you struggling with serverless or need guidance on best practices? Do you want someone to review your architecture and help you avoid costly mistakes down the line? Whatever the case, I’m here to help.

Hire me.

Skill up your serverless game with this hands-on workshop.

My 4-week Production-Ready Serverless online workshop is back!

This course takes you through building a production-ready serverless web application from testing, deployment, security, all the way through to observability. The motivation for this course is to give you hands-on experience building something with serverless technologies while giving you a broader view of the challenges you will face as the architecture matures and expands.

We will start at the basics and give you a firm introduction to Lambda and all the relevant concepts and service features (including the latest announcements in 2020). And then gradually ramping up and cover a wide array of topics such as API security, testing strategies, CI/CD, secret management, and operational best practices for monitoring and troubleshooting.

If you enrol now you can also get 15% OFF with the promo code “yanprs15”.

Enrol now and SAVE 15%.

Check out my new podcast Real-World Serverless where I talk with engineers who are building amazing things with serverless technologies and discuss the real-world use cases and challenges they face. If you’re interested in what people are actually doing with serverless and what it’s really like to be working with serverless day-to-day, then this is the podcast for you.

Check out my new course, Learn you some Lambda best practice for great good! In this course, you will learn best practices for working with AWS Lambda in terms of performance, cost, security, scalability, resilience and observability. We will also cover latest features from re:Invent 2019 such as Provisioned Concurrency and Lambda Destinations. Enrol now and start learning!

Check out my video course, Complete Guide to AWS Step Functions. In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. There is something for everyone from beginners to more advanced users looking for design patterns and best practices. Enrol now and start learning!

1 thought on “Applying the Saga pattern with AWS Lambda and Step Functions”

  1. Theoretically that is really nice, but in practice it causes a number of different issues. It also fails to sufficiently deal with the problem mentioned in the linked paper:

    However, unlike other transactions, the transactions m a saga are related to each other and should be executed as a (non-atomic) unit any partial executions of the saga are undesirable, and if they occur, must be compensated for

    The following problems occur:
    * Complex state management having to understand and organize all the possible states. What if we wanted to add two flight reservations or at the point of placing the car rental change the dates for the flight. Using a flow diagram is suboptimal.
    * Tasks for (Success, Rollback) for each type of transaction are separated, which mean in practice duplicating the management of the resources available or managed in the Happy Path.
    * Scaling in next to impossible to do the fixed flow. Take adding the ability to cancel an already successful path. What if the original step function didn’t include flight logic. Now you’ll have to handle that.
    * Repeatable idempotent transactions, while it is true that all the individual transactions look successful, it is impossible to know if the whole chain was successful without re-executing the flow. This causes consistency issues, for instance let’s say you have 2 tasks to perform:
    1. Order Widget
    2. Send Email to Customer

    What happens if email sending succeeds but the state machine traversal fails. That means you’ll think the process was unsuccessful, but really both the order and the email worked. That’s because there is a third hidden 3. Set state to success.

    What works better is a decentralized parallel transactions that self contain the successful and failure paths
    * API => Task Manager
    * Task Manager => FlightManager.Reserve Flight => onFailure => Cancel Filght
    * Task Manager => HotelManager.Reserve Hotel => onFailure => Cancel Hotel
    * Task Manager => CarManager.Reserve Car => onFailure => Cancel Car
    * Done

    Rather than the forced synchronous approach, this also provides a successful async approach. At any point you can observe the current state of the system via
    * API => Get State
    * Get State => FlightManager.GetState, HotelManager.GetState, CarManager.GetState

    and jump any where into the process when you want to:
    * API => Cancel Flight => FlightManager.CancelFlight

    Without worrying about the other parts of the flow. Since everything is self contained, you never to ask the question, hmmm, I’ve handled the Flight in the FlightManager, does the FlightManager need to do something else. The answer is you are always good because each Manager was designed to handle the full state machine with idempotency of its own resources.

Comments are closed.