You can become a serverless blackbelt. Enrol to my 4-week online workshop Production-Ready Serverless and gain hands-on experience building something from scratch using serverless technologies. At the end of the workshop, you should have a broader view of the challenges you will face as your serverless architecture matures and expands. You should also have a firm grasp on when serverless is a good fit for your system as well as common pitfalls you need to avoid. Sign up now and get 15% discount with the code yanprs15!
CloudWatch Events let you easily create cron jobs with Lambda. However, it’s not designed for running lots of ad-hoc tasks, each to be executed once, at a specific time. The default limit on CloudWatch Events is a lowly 100 rules per region per account. It’s a soft limit, so it’s possible to request a limit increase. But the low initial limit suggests it’s not designed for use cases where you need to schedule millions of ad-hoc tasks.
CloudWatch Events is designed for executing recurring tasks.
It’s possible to do this in just about every programming language. For example, .Net has the
setInterval function. But I often find myself wanting a service abstraction to work with. There are many use cases for such a service, for example:
- A tournament system for games would need to execute business logic when the tournament starts and finishes.
- An event system (think eventbrite.com or meetup.com) would need a mechanism to send out timely reminders to attendees.
- A to-do tracker (think wunderlist) would need a mechanism to send out reminders when a to-do task is due.
However, AWS does not offer a service for this type of workloads. CloudWatch Events is the closest thing, but as discussed above it’s not intended for the use cases above. You can, however, implement them using cron jobs. But such implementations have other challenges.
I have implemented such service abstraction a few times in my career already. I experimented with a number of different approaches:
- cron job (with CloudWatch Events)
- wrapping the .Net
Timerclass as an HTTP endpoint
- using SQS Visibility Timeout to hide tasks until they’re due
And lately I have seen a number folks use DynamoDB Time-To-Live (TTL) to implement these ad-hoc tasks. In this post, we will take a look at this approach and see where it might be applicable for you.
How do we measure the approach?
For this type of ad-hoc tasks, we normally care about:
- Precision: how close to my scheduled time is the task executed? The closer the better.
- Scale (number of open tasks): can the solution scale to support many open tasks. I.e. tasks that are scheduled but not yet executed.
- Scale (hotspots): can the solution scale to execute many tasks around the same time? E.g. millions of people set timer to remind themselves to watch the superbowl, so all the timers fire within close proximity to kickoff time.
DynamoDB TTL as a scheduling mechanism
From a high level this approach looks like this:
scheduled_itemsDynamoDB table which holds all the tasks that are scheduled for execution.
schedulerfunction that writes the scheduled task into the
scheduled_itemstable, with the TTL set to the scheduled execution time.
execute-on-schedulefunction that subscribes to the DynamoDB Stream for
scheduled_itemsand react to
REMOVEevents. These events corresponds to when items have been deleted from the table.
Scalability (number of open tasks)
Since the number of open tasks just translates to the number of items in the
scheduled_items table, this approach can scale to millions of open tasks.
DynamoDB can handle large throughputs (thousands of TPS) too. So this approach can also be applied to scenarios where thousands of items are scheduled per second.
When many items are deleted at the same time, they are simply queued in the DynamoDB Stream. AWS also autoscales the number of shards in the stream, so as throughput increases the number of shards would go up accordingly.
But, events are processed in sequence. So it can take some time for your function to process the event depending on:
- its position in the stream, and
- how long it takes to process each event.
So, while this approach can scale to support many tasks all expiring at the same time, it cannot guarantee that tasks are executed on time.
This is the big question about this approach. According to the official documentation, expired items are deleted within 48 hours. That is a huge margin of error!
As an experiment, I set up a Step Functions state machine to:
- add a configurable number of items to the
scheduled_itemstable, with TTL expiring between 1 and 10 mins
- track the time the task is scheduled for and when it’s actually picked up by the
- wait for all the items to be deleted
The state machine looks like this:
I performed several runs of tests. The results are consistent regardless the number of items in the table. A quick glimpse at the table tells you that, on average, a task is executed over 11 mins AFTER its scheduled time.
I repeated the experiments in several other AWS regions:
I don’t know why there are such marked difference between US-EAST-1 and the other regions. One explanation is that the TTL process requires a bit of time to kick in after a table is created. Since I was developing against the US-EAST-1 region initially, its TTL process has been “warmed” compared to the other regions.
Based on the result of my experiment, it will appear that using DynamoDB TTL as a scheduling mechanism cannot guarantee a reasonable precision.
On the one hand, the approach scales very well. But on the other, the scheduled tasks are executed at least severals minutes behind, which renders it unsuitable for many use cases.
Read about other approaches
- Using CloudWatch and Lambda to implement ad-hoc scheduling
- Step Functions as an ad-hoc scheduling mechanism
Hi, I’m Yan. I’m an AWS Serverless Hero and I help companies go faster for less by adopting serverless technologies successfully.
Are you struggling with serverless or need guidance on best practices? Do you want someone to review your architecture and help you avoid costly mistakes down the line? Whatever the case, I’m here to help.
Skill up your serverless game with this hands-on workshop.
My 4-week Production-Ready Serverless online workshop is back!
This course takes you through building a production-ready serverless web application from testing, deployment, security, all the way through to observability. The motivation for this course is to give you hands-on experience building something with serverless technologies while giving you a broader view of the challenges you will face as the architecture matures and expands.
We will start at the basics and give you a firm introduction to Lambda and all the relevant concepts and service features (including the latest announcements in 2020). And then gradually ramping up and cover a wide array of topics such as API security, testing strategies, CI/CD, secret management, and operational best practices for monitoring and troubleshooting.
If you enrol now you can also get 15% OFF with the promo code “yanprs15”.
Check out my new podcast Real-World Serverless where I talk with engineers who are building amazing things with serverless technologies and discuss the real-world use cases and challenges they face. If you’re interested in what people are actually doing with serverless and what it’s really like to be working with serverless day-to-day, then this is the podcast for you.
Check out my new course, Learn you some Lambda best practice for great good! In this course, you will learn best practices for working with AWS Lambda in terms of performance, cost, security, scalability, resilience and observability. We will also cover latest features from re:Invent 2019 such as Provisioned Concurrency and Lambda Destinations. Enrol now and start learning!
Check out my video course, Complete Guide to AWS Step Functions. In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. There is something for everyone from beginners to more advanced users looking for design patterns and best practices. Enrol now and start learning!
Here is a complete list of all my posts on serverless and AWS Lambda. In the meantime, here are a few of my most popular blog posts.
- All you need to know about caching for serverless applications
- Lambda optimization tip – enable HTTP keep-alive
- You are wrong about serverless and vendor lock-in
- You are thinking about serverless costs all wrong
- Just how expensive is the full AWS SDK?
- Check-list for going live with API Gateway and Lambda
- How to choose the right API Gateway auth method
- CloudFormation protip: use !Sub instead of !Join
- AWS Lambda – should you have few monolithic functions or many single-purposed functions?
- Guys, we’re doing pagination wrong
- Top 10 Serverless framework best practices
- How to break the “senior engineer” career ceiling
- My advice to junior developers