How to implement Durable Execution for Lambda (without frameworks)

Yan Cui

I help clients go faster for less using serverless technologies.

“Durable Execution” means a system can execute workflows reliably and remember progress even when failures occur. It typically involves:

  1. Persisting progress to avoid repeating side-effects.
  2. Recovering gracefully from failures.
  3. Maintaining a history of executions.

It’s commonly associated with workflow orchestrators such as AWS Step Functions. AWS Lambda doesn’t support durability by itself.

How Restate Cloud does it

But, as shown by Restate [1], you can add durable execution to Lambda. You can see this in action in my conversation [2] with Jack Kleeman, who created Restate Cloud [3].

Restate Cloud achieves this by creating checkpoints in your code. For example, when you use ctx.run to execute a step durably, as below:

greet: async (ctx: restate.Context, name: string) => {
  // this step is executed durably (i.e. the result is saved in the db)
  // and on subsequent executions, the result is fetched from the db
  // instead of executing the enclosed code again
  // hence preventing repeating the side-effect of sending a notification
  await ctx.run(() => sendNotification(name));

  // do other stuff...

  return `You said hi to ${name}!`;
}

Of course, Restate Cloud does much more than durable executions (check my conversation [2] with Jack for more).

Restate is a complete framework for building resilient applications. As a framework, it wants to do things in a certain way. Using it means adapting your existing code to its specific style and structure.

If reliability is crucial, this rewrite might be worth it, especially since Restate can be simpler to adopt than Step Functions.

However, sometimes you just want basic durability for an existing Lambda function without the extra features. This was my situation in a recent project [4] building an AI-powered code reviewer called Evolua [5].

Why we needed durable execution

We had a critical Lambda triggered by GitHub webhooks through EventBridge.

The function performs several steps:

  1. Comment on the PR to indicate it started reviewing changes.
  2. Analyze code using Bedrock or Anthropic.
  3. Save analysis results to DynamoDB.
  4. Create a review thread on GitHub with the findings.

This Lumigo [6] trace shows the function calls quite a few different APIs. For large PRs, it can make hundreds of API calls!

The function benefits from Lambda’s built-in retry (two retries) and DLQ mechanism for async invocations. However, repeating previous side-effects (like GitHub comments) during retries wasn’t acceptable.

That’s why we needed durable executions.

But adopting Restate or rewriting the function with Step Functions wasn’t ideal because both would require a significant amount of effort. In the case of Step Functions, it’d also complicate testing and development.

How we implemented durable execution for Lambda

Instead, we took inspiration from Restate to create a simple DynamoDB-based checkpoint system. The crux of it is these 10 lines of code:

export const executeIdempotently = async <T>(uniqId: string, f: () => Promise<T>): Promise<T> => {
  const result = await getResult(uniqId);

  if (result) {
    return result;
  }

  const newResult = await f();

  await putResult(uniqId, newResult);  
  return newResult;
}

It stores and retrieves results using a unique ID.

To use this helper function, we write something like this:

const initCommentId = await executeIdempotently(
  `${IDEMPOTENCY_PREFIX}-sendStartReviewMessage`,
  () => commentOnPR(pullRequest.node_id, getStartReviewMessage(username))
);

During retries, this step returns the previously saved result instead of repeating the action.

To ensure consistency and uniqueness, the IDEMPOTENCY_PREFIX consists of:

  • A unique event ID is generated by our code and captured in our custom event envelope [7].
  • The name of the function.

Summary

This simple approach provides durability to Lambda functions by:

  • Using checkpoints to save progress and avoid repeating side-effects.
  • Relying on Lambda’s built-in retry mechanism.
  • Keeping execution history in DynamoDB.

Since it’s just standard Lambda code, my usual testing strategy still apply. This includes using remocal tests to achieve a good local development experience and fast feedback loop.

We also have good visibility into what the system is doing through Lumigo. The visual workflow from Step Functions would be nice but it’s not necessary for troubleshooting.

This is a simple solution to a simple use case, just the way I like it!

Links

[1] Restate

[2] Is Restate.dev the Step Functions killer?

[3] Restate Cloud

[4] How we built an AI code reviewer with serverless and Bedrock

[5] Evolua – ship better code faster with automated code reviews

[6] Lumigo – the best observability platform for serverless applications

[7] EventBridge best practice: why you should wrap events in event envelopes

[8] My testing strategy for serverless applications

Related Posts

Whenever you’re ready, here are 3 ways I can help you:

  1. Production-Ready Serverless: Join 20+ AWS Heroes & Community Builders and 1000+ other students in levelling up your serverless game. This is your one-stop shop for quickly levelling up your serverless skills.
  2. I help clients launch product ideas, improve their development processes and upskill their teams. If you’d like to work together, then let’s get in touch.
  3. Join my community on Discord, ask questions, and join the discussion on all things AWS and Serverless.