Yubl’s road to Serverless architecture – building a scalable push notification system

You can become a serverless blackbelt. Enrol to my 4-week online workshop Production-Ready Serverless and gain hands-on experience building something from scratch using serverless technologies. At the end of the workshop, you should have a broader view of the challenges you will face as your serverless architecture matures and expands. You should also have a firm grasp on when serverless is a good fit for your system as well as common pitfalls you need to avoid. Sign up now and get 15% discount with the code yanprs15!

part 1: overview

part 2: testing and CI/CD

part 3: ops

part 4: building a scalable notification system <- you’re here

part 5: building a better recommendation system


Just before Yubl’s untimely demise we did an interesting piece of work to redesign the system for sending targeted push notifications to our users to improve retention.

The old system relied on MixPanel for both selecting users as well as sending out the push notifications. Whilst MixPanel was great for getting us basic analytics quickly, we soon found our use cases outgrew MixPanel. The most pressing limitation was that we were not able to query users based on their social graph to create target push notifications – eg. notify an influencer’s followers when he/she publishes a new post or runs a new social media campaign.

Since all of our analytics events are streamed to Google BigQuery (using a combination of Kinesis Firehose, S3 and Lambda) we have all the data we need to support the complex use cases the product team has.

What we needed, was a push notification system that can integrate with BigQuery results and is capable of sending millions of push notifications in a batch.

Design Goals

From a high level, we need to support 2 types of notifications.

Ad-hoc notifications are driven by the marketing team, working closely with influencers and the BI team to match users with influencers or contents that they might be interested in. Example notifications include:

  • users who follow Accessorize and other fashion brands might be interested to know when another notable fashion brand joins the platform
  • users who follow an influencer might be interested to know when the influencer publishes a new post or is running a social media campaign (usually with give-away prizes, etc.)
  • users who have shared/liked music related contents might be interested to know that Tinie Tempah has joined the platform

Scheduled notifications are driven by the product team, these notifications are designed to nudge users to finish the sign up process or to come back to the platform after they have lapsed. Example notifications include:

  • day-1 unfinished sign up : notify users who didn’t finish the sign up process to come back to complete the process
  • day-2 engagement : notify users to come back and follow more people or invite friends on day 2
  • day-21 inactive : notify users who have not logged into the app for 21 days to come back and check out what’s new

A/B testing

For the scheduled notifications, we want to test out different messages/layouts to optimise their effectiveness over time. To do that, we wanted to support A/B testing as part of the new system (which MixPanel already supports).

We should be able to create multiple variants (each with a percentage), along with a control group who will not receive any push notifications.

Oversight vs Frictionless

For the ad-hoc notifications, we don’t want to get in the way of the marketing team doing their job, so the process for creating ad-hoc push notifications should be as frictionless as possible. However, we also don’t want the marketing team to operate completely without oversight and run the risk of long term damage by spamming users with unwanted push notifications (which might cause users to disable notifications or even rage quit the app).

The compromise we reached was an automated approval process whereby:

  1. the marketing team will work with BI on a query to identify users (eg. followers of Tinie Tempah)
  2. fill in a request form, which informs designated approvers via email
  3. approvers can send themselves a test push notification to see how it will be formatted on both Android and iOS
  4. approvers can approve or reject the request
  5. once approved, the request will be executed

Implementation

We decided to use S3 as the source for a send-batch-notifications function because it allows us to pass large list of users (remember, the goal is to support sending push notifications to millions of users in a batch) without having to worry about pagination or limits on payload size.

The function will work with any JSON file in the right format, and that JSON file can be generated in many ways:

  • by the cron jobs that generate scheduled notifications
  • by the approval system after an ad-hoc push notification is approved
  • by the approval system to send a test push notification to the approvers (to visually inspect how the message will appear on both Android and iOS devices)
  • by members of the engineering team when manual interventions are required

We also considered moving to SNS but decided against it in the end because it doesn’t provide useful enough an abstraction to justify the effort to migrate (involves client work) and the additional cost for sending push notifications. Instead, we used node-gcm and apn to communicate with GCM and APN directly.

Recursive Functions FTW

Lambda has a hard limit of 5 mins execution time (it might be softened in the near future), and that might not be enough time to send millions of push notifications.

Our approach to long-running tasks like this is to run the Lambda function as a recursive function.

A naive recursive function would process the payload in fixed size batches and recurse at the end of each batch whilst passing along a token/position to allow the next invocation to continue from where it left off. In this particular case, we have additional considerations because the total number of work items can be very large:

  • minimising the no. of recursions required, which equates to no. of Invoke requests to Lambda and carries a cost implication at scale
  • caching the content of the JSON file to improve performance (by avoiding loading and parsing a large JSON file more than once) and reduce S3 cost

To minimise the no. of recursions, our function would:

  1. process the list of users in small batches of 500
  2. at the end of each batch, call context.getRemainingTimeInMillis() to check how much time is left in this invocation
  3. if there is more than 1 min left in the invocation then process another batch; otherwise recurse

When caching the content of the JSON file from S3, we also need to compare the ETAG to ensure that the content of the file hasn’t changed.

With this set up the system was able to easily handle JSON files with more than 1 million users during our load test (sorry Apple and Google for sending all those fake device tokens :-P).

Liked this article? Support me on Patreon and get direct help from me via a private Slack channel or 1-2-1 mentoring.
Subscribe to my newsletter


Hi, I’m Yan. I’m an AWS Serverless Hero and I help companies go faster for less by adopting serverless technologies successfully.

Are you struggling with serverless or need guidance on best practices? Do you want someone to review your architecture and help you avoid costly mistakes down the line? Whatever the case, I’m here to help.

Hire me.


Skill up your serverless game with this hands-on workshop.

My 4-week Production-Ready Serverless online workshop is back!

This course takes you through building a production-ready serverless web application from testing, deployment, security, all the way through to observability. The motivation for this course is to give you hands-on experience building something with serverless technologies while giving you a broader view of the challenges you will face as the architecture matures and expands.

We will start at the basics and give you a firm introduction to Lambda and all the relevant concepts and service features (including the latest announcements in 2020). And then gradually ramping up and cover a wide array of topics such as API security, testing strategies, CI/CD, secret management, and operational best practices for monitoring and troubleshooting.

If you enrol now you can also get 15% OFF with the promo code “yanprs15”.

Enrol now and SAVE 15%.


Check out my new podcast Real-World Serverless where I talk with engineers who are building amazing things with serverless technologies and discuss the real-world use cases and challenges they face. If you’re interested in what people are actually doing with serverless and what it’s really like to be working with serverless day-to-day, then this is the podcast for you.


Check out my video course, Complete Guide to AWS Step Functions. In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. There is something for everyone from beginners to more advanced users looking for design patterns and best practices. Enrol now and start learning!


Check out my video course, Complete Guide to AWS Step Functions. In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. There is something for everyone from beginners to more advanced users looking for design patterns and best practices. Enrol now and start learning!