Applying the Saga pattern with AWS Lambda and Step Functions

The Saga pat­tern is a pat­tern for man­ag­ing fail­ures, where each action has a com­pen­sat­ing action for roll­back.

In Hec­tor Garcia-Molina’s 1987 paper, it is described as an approach to han­dling sys­tem fail­ures in a long-run­ning trans­ac­tions.

It has become increas­ing­ly rel­e­vant in the world of microser­vices as appli­ca­tion log­ic often needs to trans­act across mul­ti­ple bound­ed con­texts — each encap­su­lat­ed by its own microser­vice with inde­pen­dent data­bas­es. Caitie McCaf­frey gave a good talk on using the Saga pat­tern in dis­trib­uted sys­tems, which you can watch here.

Using Caitie’s exam­ple from her talk, sup­pose we have a trans­ac­tion that goes some­thing like this:

Begin transaction
    Start book hotel request
    End book hotel request
    Start book flight request
    End book flight request
    Start book car rental request
    End book car rental request
End transaction

We can mod­el each of the actions (and their com­pen­sat­ing actions) with a Lamb­da func­tion, and use a state machine in Step Func­tion as the coor­di­na­tor for the saga.

Because the com­pen­sat­ing actions can also fail so we need to be able to retry them until suc­cess, which means they have to be idem­po­tent.

In the exam­ple below, we’ll imple­ment back­ward recov­ery in the event of a fail­ure.

Each Lamb­da func­tion expects the input to be in the fol­low­ing shape.

  "trip_id": "5c12d94a-ee6a-40d9-889b-1d49142248b7",
  "depart": "London",
  "depart_at": "2017-07-10T06:00:00.000Z",
  "arrive": "Dublin",
  "arrive_at": "2017-07-12T08:00:00.000Z",
  "hotel": "holiday inn",
  "check_in": "2017-07-10T12:00:00.000Z",
  "check_out": "2017-07-12T14:00:00.000Z",
  "rental": "Volvo",
  "rental_from": "2017-07-10T00:00:00.000Z",
  "rental_to": "2017-07-12T00:00:00.000Z"

Inside each of the func­tions is a sim­ple PutItem request against a dif­fer­ent DynamoDB table. The cor­re­spond­ing com­pen­sat­ing func­tion will per­form a DeleteItem against the cor­re­spond­ing table to roll­back the PutItem action.

The state machine pass the same input to each action in turn:

  1. BookHo­tel
  2. Book­Flight
  3. BookRental

and record their results at a spe­cif­ic path (so to avoid over­rid­ing the input $ that will be passed to the next func­tion).

In this naive imple­men­ta­tion, we’ll apply the com­pen­sat­ing action for any fail­ure — hence the State.ALL below. In prac­tice, you should con­sid­er giv­ing cer­tain error types a retry — eg. tem­po­ral errors such as DynamoDB’s pro­vi­sion through­put exceed­ed excep­tions.

Success Case

Fol­low­ing the hap­py path, each of the actions are per­formed in turn and the state machine will end suc­cess­ful­ly.

Failure Cases

When fail­ures strike, depend­ing on where the fail­ure occurs we need to apply the cor­re­spond­ing com­pen­sat­ing actions in turn.

In the exam­ples below, if the fail­ure hap­pened at BookFlight, then both CancelFlight and CancelHotel will be exe­cut­ed to roll­back any changes per­formed thus far.

Sim­i­lar, if the fail­ure hap­pened at BookRental, then all three com­pen­sat­ing actions — CancelRental, CancelFlight and CancelHotel — will be exe­cut­ed in that order to roll­back all the state changes from the trans­ac­tion.

Each com­pen­sat­ing action also have an infi­nite retry loop! In prac­tice, there should be a rea­son­able upper lim­it on the no. of retries before you alert for human inter­ven­tion.

You can find the source code for this demo here.