At DAZN (where I no longer work), the teams work with a number of third-party providers. They often have to synchronize data between different AWS accounts. SNS to SQS is the primary mechanism for these cross-account deliveries because:
- it was an established pattern within the organization
- DAZN engineers and third-party engineers are both familiar with SNS and SQS, as well as using Lambda to process SQS events
- Lambda auto-scales the concurrency executions based on load
DAZN has many millions of subscribers worldwide and serves over a million concurrent viewers during live events. Its traffic pattern is very spiky and centres around these live sporting events.
Many of DAZN’s microservices live in the same AWS account (they are still in the process of moving to one account per team per environment). So these microservices contend for the same regional limits such as the number of concurrent Lambda executions.
One of the miroservices ingests events from a third-party AWS account and immediately pushes them to a Kinesis stream.
This microservice experiences large bursts of traffic immediately before a sporting event starts. Unfortunately, many other microservices also experience these spikes at the same time!
Because SQS auto-scales the number of concurrent executions, the Lambda function (that you see above) uses up too much of the available concurrency. It causes Lambda throttling events in the region, to both itself, as well as to other functions in the same region.
There is time pressure to find a suitable solution, so my buddies at DAZN reached out and we brainstormed some solutions.
Solutions 1 – Lambda concurrency limit
Setting a reserved concurrency for the SQS function would be the simplest solution. However, you’d have to deal with the fallout from this decision. There are problems with Lambda concurrency limit and SQS. I suggest you read this and this post for more information on this. The root problem is that there is a disconnect between the number of SQS pollers (managed by Lambda) and your function’s concurrency limit.
If the SQS poller is repeatedly throttled when attempting to forward a message to your function. Then the message can be redirected to the DLQ before your function gets a chance to process it. This would be the worst-case scenario.
Even if a message is throttled just once and processed after the visibility timeout, it can still cause havoc. This delay (the visibility timeout) allows follow-up events to precede this original message in the Kinesis stream. This ordering issue already exists today because normal SQS queues do not preserve the ordering of events. However, it affects less than 1% of customers and the team does not feel it’s a significant issue. With throttling and retries, it becomes a much more pressing problem for downstream functions.
Solution 2 – use a separate AWS account
Moving this microservice into its own account would alleviate the contention issue (for concurrent executions). Doing this have other benefits too, and is a task that is already in the pipeline. However, the third-party vendor does not currently allow SQS subscription from another DAZN-owned AWS account.
Solution 3 – switch to EventBridge
Switching to EventBridge would be another option. SNS supports few targets for cross-account delivery – HTTP, SQS or Lambda. EventBridge can deliver to far more targets, including Kinesis streams, ECS tasks, Step Functions and more. However, this requires significant changes from the third-party. Or it involves creating a DAZN-side sink in the main account and then use EventBridge to fan out to other accounts (see below).
This could be a viable solution and offers a lot of flexibility going forward. But it also faces a number of challenges, such as:
- If the third-party doesn’t change to EventBridge then you still have the same concurrency issue with the SQS function.
- It requires coordination to move multiple teams to an unfamiliar service.
Most importantly, it’s not a simple change and there is time pressure at play.
Solution 4 – go direct from SNS to Kinesis (via API Gateway)
Instead of going through SQS and Lambda, you can go directly to Kinesis via an API Gateway service proxy. This means we’d subscribe an HTTPs endpoint to the third-party SNS topic instead of SQS.
This removes Lambda from the equation completely. However, API Gateway has its own throttling and contention issue. By default, API Gateway has a regional limit of 10,000 reqs/s (for all APIs). Fortunately, this is a soft limit and can be raised via a support ticket.
This was an interesting idea, so I built a simple proof-of-concept to see how it could work. You can find the source code for the demo project on GitHub here.
Connecting SNS to Kinesis via API Gateway
There are a couple of things to note:
- When you subscribe an endpoint to an SNS topic, SNS would first send a
POSTrequest to the endpoint to confirm the subscription. This page explains the confirmation flow.
POSTrequest contains a JSON payload like the following. You need to send a
GETrequest to the
SubscribeURLto confirm the subscription.
- You would need to subscribe a Lambda function to the Kinesis stream to perform the request confirmation.
- Weirdly, the
POSTrequest uses the content type
text/plain. So you would need a custom request template mapping in API Gateway for
- You would also need to write some custom VTL code to map the request to a Kinesis
The plugin makes it easy to set up service proxies for API Gateway. All I needed was to add some configuration like the following. Notice that I used
Fn::Sub to weave the stream name into the VTL code to avoid hard coding.
This configuration adds a
/kinesis endpoint to the API, which forwards the requests from SNS to our Kinesis stream.
In the demo project, there is also Lambda function which is subscribed to the Kinesis stream. This function is responsible for confirming the subscription request.
However, this function only needs to run once – when the SNS topic sends its confirmation request. It would continue to receive all subsequent events and would just ignore them. That seems like such a waste!
What if the function can disable itself after confirming the subscription?
You can do just that by disabling the function’s event source mapping.
Of course, you would need the relevant IAM permissions for that.
Trying it out!
Once the project is deployed, go to the SNS topic. Create a new subscription against the
In the Kinesis function’s logs, you should see a
SubscriptionConfirmation event from SNS.
After that, you should see the logs to indicate the function is attempting to disable its Kinesis event source mapping.
Now go to the Lambda console, and find the Kinesis function. Click on the Kinesis event source, and you should see its status changed to
Meanwhile, if you go back to the SNS topic then you should see the subscription has been confirmed. If you publish a message to the topic then the message would be recorded in the stream but would not invoke the Kinesis function.
So that’s it, I hope you enjoyed that! It was a fun little thought experiment and demo project for a nice weekend.
I’m an AWS Serverless Hero and the author of Production-Ready Serverless. I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless.
Check out my new course, Complete Guide to AWS Step Functions.
In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, callbacks, nested workflows, design patterns and best practices.
Come learn about operational BEST PRACTICES for AWS Lambda: CI/CD, testing & debugging functions locally, logging, monitoring, distributed tracing, canary deployments, config management, authentication & authorization, VPC, security, error handling, and more.
You can also get 40% off the face price with the code ytcui.
Here is a complete list of all my posts on serverless and AWS Lambda. In the meantime, here are a few of my most popular blog posts.
- Lambda optimization tip – enable HTTP keep-alive
- You are thinking about serverless costs all wrong
- Many faced threats to Serverless security
- We can do better than percentile latencies
- I’m afraid you’re thinking about AWS Lambda cold starts all wrong
- Yubl’s road to Serverless
- AWS Lambda – should you have few monolithic functions or many single-purposed functions?
- AWS Lambda – compare coldstart time with different languages, memory and code sizes
- Guys, we’re doing pagination wrong