You can become a serverless blackbelt. Enrol to my 4-week online workshop Production-Ready Serverless and gain hands-on experience building something from scratch using serverless technologies. At the end of the workshop, you should have a broader view of the challenges you will face as your serverless architecture matures and expands. You should also have a firm grasp on when serverless is a good fit for your system as well as common pitfalls you need to avoid. Sign up now and get 15% discount with the code yanprs15!
At DAZN (where I no longer work), the teams work with a number of third-party providers. They often have to synchronize data between different AWS accounts. SNS to SQS is the primary mechanism for these cross-account deliveries because:
- it was an established pattern within the organization
- DAZN engineers and third-party engineers are both familiar with SNS and SQS, as well as using Lambda to process SQS events
- Lambda auto-scales the concurrency executions based on load
DAZN has many millions of subscribers worldwide and serves over a million concurrent viewers during live events. Its traffic pattern is very spiky and centres around these live sporting events.
Many of DAZN’s microservices live in the same AWS account (they are still in the process of moving to one account per team per environment). So these microservices contend for the same regional limits such as the number of concurrent Lambda executions.
One of the miroservices ingests events from a third-party AWS account and immediately pushes them to a Kinesis stream.
This microservice experiences large bursts of traffic immediately before a sporting event starts. Unfortunately, many other microservices also experience these spikes at the same time!
Because SQS auto-scales the number of concurrent executions, the Lambda function (that you see above) uses up too much of the available concurrency. It causes Lambda throttling events in the region, to both itself, as well as to other functions in the same region.
There is time pressure to find a suitable solution, so my buddies at DAZN reached out and we brainstormed some solutions.
Solutions 1 – Lambda concurrency limit
Setting a reserved concurrency for the SQS function would be the simplest solution. However, you’d have to deal with the fallout from this decision. There are problems with Lambda concurrency limit and SQS. I suggest you read this and this post for more information on this. The root problem is that there is a disconnect between the number of SQS pollers (managed by Lambda) and your function’s concurrency limit.
If the SQS poller is repeatedly throttled when attempting to forward a message to your function. Then the message can be redirected to the DLQ before your function gets a chance to process it. This would be the worst-case scenario.
Even if a message is throttled just once and processed after the visibility timeout, it can still cause havoc. This delay (the visibility timeout) allows follow-up events to precede this original message in the Kinesis stream. This ordering issue already exists today because normal SQS queues do not preserve the ordering of events. However, it affects less than 1% of customers and the team does not feel it’s a significant issue. With throttling and retries, it becomes a much more pressing problem for downstream functions.
Solution 2 – use a separate AWS account
Moving this microservice into its own account would alleviate the contention issue (for concurrent executions). Doing this have other benefits too, and is a task that is already in the pipeline. However, the third-party vendor does not currently allow SQS subscription from another DAZN-owned AWS account.
Solution 3 – switch to EventBridge
Switching to EventBridge would be another option. SNS supports few targets for cross-account delivery – HTTP, SQS or Lambda. EventBridge can deliver to far more targets, including Kinesis streams, ECS tasks, Step Functions and more. However, this requires significant changes from the third-party. Or it involves creating a DAZN-side sink in the main account and then use EventBridge to fan out to other accounts (see below).
This could be a viable solution and offers a lot of flexibility going forward. But it also faces a number of challenges, such as:
- If the third-party doesn’t change to EventBridge then you still have the same concurrency issue with the SQS function.
- It requires coordination to move multiple teams to an unfamiliar service.
Most importantly, it’s not a simple change and there is time pressure at play.
Solution 4 – go direct from SNS to Kinesis (via API Gateway)
Instead of going through SQS and Lambda, you can go directly to Kinesis via an API Gateway service proxy. This means we’d subscribe an HTTPs endpoint to the third-party SNS topic instead of SQS.
This removes Lambda from the equation completely. However, API Gateway has its own throttling and contention issue. By default, API Gateway has a regional limit of 10,000 reqs/s (for all APIs). Fortunately, this is a soft limit and can be raised via a support ticket.
This was an interesting idea, so I built a simple proof-of-concept to see how it could work. You can find the source code for the demo project on GitHub here.
Connecting SNS to Kinesis via API Gateway
There are a couple of things to note:
- When you subscribe an endpoint to an SNS topic, SNS would first send a
POSTrequest to the endpoint to confirm the subscription. This page explains the confirmation flow.
POSTrequest contains a JSON payload like the following. You need to send a
GETrequest to the
SubscribeURLto confirm the subscription.
- You would need to subscribe a Lambda function to the Kinesis stream to perform the request confirmation.
- Weirdly, the
POSTrequest uses the content type
text/plain. So you would need a custom request template mapping in API Gateway for
- You would also need to write some custom VTL code to map the request to a Kinesis
The plugin makes it easy to set up service proxies for API Gateway. All I needed was to add some configuration like the following. Notice that I used
Fn::Sub to weave the stream name into the VTL code to avoid hard coding.
This configuration adds a
/kinesis endpoint to the API, which forwards the requests from SNS to our Kinesis stream.
In the demo project, there is also Lambda function which is subscribed to the Kinesis stream. This function is responsible for confirming the subscription request.
However, this function only needs to run once – when the SNS topic sends its confirmation request. It would continue to receive all subsequent events and would just ignore them. That seems like such a waste!
What if the function can disable itself after confirming the subscription?
You can do just that by disabling the function’s event source mapping.
Of course, you would need the relevant IAM permissions for that.
Trying it out!
Once the project is deployed, go to the SNS topic. Create a new subscription against the
In the Kinesis function’s logs, you should see a
SubscriptionConfirmation event from SNS.
After that, you should see the logs to indicate the function is attempting to disable its Kinesis event source mapping.
Now go to the Lambda console, and find the Kinesis function. Click on the Kinesis event source, and you should see its status changed to
Meanwhile, if you go back to the SNS topic then you should see the subscription has been confirmed. If you publish a message to the topic then the message would be recorded in the stream but would not invoke the Kinesis function.
So that’s it, I hope you enjoyed that! It was a fun little thought experiment and demo project for a nice weekend.
Hi, I’m Yan. I’m an AWS Serverless Hero and I help companies go faster for less by adopting serverless technologies successfully.
Are you struggling with serverless or need guidance on best practices? Do you want someone to review your architecture and help you avoid costly mistakes down the line? Whatever the case, I’m here to help.
Skill up your serverless game with this hands-on workshop.
My 4-week Production-Ready Serverless online workshop is back!
This course takes you through building a production-ready serverless web application from testing, deployment, security, all the way through to observability. The motivation for this course is to give you hands-on experience building something with serverless technologies while giving you a broader view of the challenges you will face as the architecture matures and expands.
We will start at the basics and give you a firm introduction to Lambda and all the relevant concepts and service features (including the latest announcements in 2020). And then gradually ramping up and cover a wide array of topics such as API security, testing strategies, CI/CD, secret management, and operational best practices for monitoring and troubleshooting.
If you enrol now you can also get 15% OFF with the promo code “yanprs15”.
Check out my new podcast Real-World Serverless where I talk with engineers who are building amazing things with serverless technologies and discuss the real-world use cases and challenges they face. If you’re interested in what people are actually doing with serverless and what it’s really like to be working with serverless day-to-day, then this is the podcast for you.
Check out my new course, Learn you some Lambda best practice for great good! In this course, you will learn best practices for working with AWS Lambda in terms of performance, cost, security, scalability, resilience and observability. We will also cover latest features from re:Invent 2019 such as Provisioned Concurrency and Lambda Destinations. Enrol now and start learning!
Check out my video course, Complete Guide to AWS Step Functions. In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. There is something for everyone from beginners to more advanced users looking for design patterns and best practices. Enrol now and start learning!
Here is a complete list of all my posts on serverless and AWS Lambda. In the meantime, here are a few of my most popular blog posts.
- All you need to know about caching for serverless applications
- Lambda optimization tip – enable HTTP keep-alive
- You are wrong about serverless and vendor lock-in
- You are thinking about serverless costs all wrong
- Just how expensive is the full AWS SDK?
- Check-list for going live with API Gateway and Lambda
- How to choose the right API Gateway auth method
- CloudFormation protip: use !Sub instead of !Join
- AWS Lambda – should you have few monolithic functions or many single-purposed functions?
- Guys, we’re doing pagination wrong
- Top 10 Serverless framework best practices
- How to break the “senior engineer” career ceiling
- My advice to junior developers