Yubl’s road to Serverless architecture – building a scalable push notification system

part 1: overview

part 2: test­ing and CI/CD

part 3: ops

part 4: build­ing a scal­able noti­fi­ca­tion sys­tem <- you’re here

part 5: build­ing a bet­ter rec­om­men­da­tion sys­tem

Just before Yubl’s untime­ly demise we did an inter­est­ing piece of work to redesign the sys­tem for send­ing tar­get­ed push noti­fi­ca­tions to our users to improve reten­tion.

The old sys­tem relied on Mix­Pan­el for both select­ing users as well as send­ing out the push noti­fi­ca­tions. Whilst Mix­Pan­el was great for get­ting us basic ana­lyt­ics quick­ly, we soon found our use cas­es out­grew Mix­Pan­el. The most press­ing lim­i­ta­tion was that we were not able to query users based on their social graph to cre­ate tar­get push noti­fi­ca­tions — eg. noti­fy an influencer’s fol­low­ers when he/she pub­lish­es a new post or runs a new social media cam­paign.

Since all of our ana­lyt­ics events are streamed to Google Big­Query (using a com­bi­na­tion of Kine­sis Fire­hose, S3 and Lamb­da) we have all the data we need to sup­port the com­plex use cas­es the prod­uct team has.

What we need­ed, was a push noti­fi­ca­tion sys­tem that can inte­grate with Big­Query results and is capa­ble of send­ing mil­lions of push noti­fi­ca­tions in a batch.

Design Goals

From a high lev­el, we need to sup­port 2 types of noti­fi­ca­tions.

Ad-hoc noti­fi­ca­tions are dri­ven by the mar­ket­ing team, work­ing close­ly with influ­encers and the BI team to match users with influ­encers or con­tents that they might be inter­est­ed in. Exam­ple noti­fi­ca­tions include:

  • users who fol­low Acces­sorize and oth­er fash­ion brands might be inter­est­ed to know when anoth­er notable fash­ion brand joins the plat­form
  • users who fol­low an influ­encer might be inter­est­ed to know when the influ­encer pub­lish­es a new post or is run­ning a social media cam­paign (usu­al­ly with give-away prizes, etc.)
  • users who have shared/liked music relat­ed con­tents might be inter­est­ed to know that Tinie Tem­pah has joined the plat­form

Sched­uled noti­fi­ca­tions are dri­ven by the prod­uct team, these noti­fi­ca­tions are designed to nudge users to fin­ish the sign up process or to come back to the plat­form after they have lapsed. Exam­ple noti­fi­ca­tions include:

  • day-1 unfin­ished sign up : noti­fy users who didn’t fin­ish the sign up process to come back to com­plete the process
  • day-2 engage­ment : noti­fy users to come back and fol­low more peo­ple or invite friends on day 2
  • day-21 inac­tive : noti­fy users who have not logged into the app for 21 days to come back and check out what’s new

A/B testing

For the sched­uled noti­fi­ca­tions, we want to test out dif­fer­ent messages/layouts to opti­mise their effec­tive­ness over time. To do that, we want­ed to sup­port A/B test­ing as part of the new sys­tem (which Mix­Pan­el already sup­ports).

We should be able to cre­ate mul­ti­ple vari­ants (each with a per­cent­age), along with a con­trol group who will not receive any push noti­fi­ca­tions.

Oversight vs Frictionless

For the ad-hoc noti­fi­ca­tions, we don’t want to get in the way of the mar­ket­ing team doing their job, so the process for cre­at­ing ad-hoc push noti­fi­ca­tions should be as fric­tion­less as pos­si­ble. How­ev­er, we also don’t want the mar­ket­ing team to oper­ate com­plete­ly with­out over­sight and run the risk of long term dam­age by spam­ming users with unwant­ed push noti­fi­ca­tions (which might cause users to dis­able noti­fi­ca­tions or even rage quit the app).

The com­pro­mise we reached was an auto­mat­ed approval process where­by:

  1. the mar­ket­ing team will work with BI on a query to iden­ti­fy users (eg. fol­low­ers of Tinie Tem­pah)
  2. fill in a request form, which informs des­ig­nat­ed approvers via email
  3. approvers can send them­selves a test push noti­fi­ca­tion to see how it will be for­mat­ted on both Android and iOS
  4. approvers can approve or reject the request
  5. once approved, the request will be exe­cut­ed


We decid­ed to use S3 as the source for a send-batch-notifications func­tion because it allows us to pass large list of users (remem­ber, the goal is to sup­port send­ing push noti­fi­ca­tions to mil­lions of users in a batch) with­out hav­ing to wor­ry about pag­i­na­tion or lim­its on pay­load size.

The func­tion will work with any JSON file in the right for­mat, and that JSON file can be gen­er­at­ed in many ways:

  • by the cron jobs that gen­er­ate sched­uled noti­fi­ca­tions
  • by the approval sys­tem after an ad-hoc push noti­fi­ca­tion is approved
  • by the approval sys­tem to send a test push noti­fi­ca­tion to the approvers (to visu­al­ly inspect how the mes­sage will appear on both Android and iOS devices)
  • by mem­bers of the engi­neer­ing team when man­u­al inter­ven­tions are required

We also con­sid­ered mov­ing to SNS but decid­ed against it in the end because it doesn’t pro­vide use­ful enough an abstrac­tion to jus­ti­fy the effort to migrate (involves client work) and the addi­tion­al cost for send­ing push noti­fi­ca­tions. Instead, we used node-gcm and apn to com­mu­ni­cate with GCM and APN direct­ly.

Recursive Functions FTW

Lamb­da has a hard lim­it of 5 mins exe­cu­tion time (it might be soft­ened in the near future), and that might not be enough time to send mil­lions of push noti­fi­ca­tions.

Our approach to long-run­ning tasks like this is to run the Lamb­da func­tion as a recur­sive func­tion.

A naive recur­sive func­tion would process the pay­load in fixed size batch­es and recurse at the end of each batch whilst pass­ing along a token/position to allow the next invo­ca­tion to con­tin­ue from where it left off. In this par­tic­u­lar case, we have addi­tion­al con­sid­er­a­tions because the total num­ber of work items can be very large:

  • min­imis­ing the no. of recur­sions required, which equates to no. of Invoke requests to Lamb­da and car­ries a cost impli­ca­tion at scale
  • caching the con­tent of the JSON file to improve per­for­mance (by avoid­ing load­ing and pars­ing a large JSON file more than once) and reduce S3 cost

To min­imise the no. of recur­sions, our func­tion would:

  1. process the list of users in small batch­es of 500
  2. at the end of each batch, call context.getRemainingTimeInMillis() to check how much time is left in this invo­ca­tion
  3. if there is more than 1 min left in the invo­ca­tion then process anoth­er batch; oth­er­wise recurse

When caching the con­tent of the JSON file from S3, we also need to com­pare the ETAG to ensure that the con­tent of the file hasn’t changed.

With this set up the sys­tem was able to eas­i­ly han­dle JSON files with more than 1 mil­lion users dur­ing our load test (sor­ry Apple and Google for send­ing all those fake device tokens :-P).

Like what you’re read­ing? Check out my video course Pro­duc­tion-Ready Server­less and learn the essen­tials of how to run a server­less appli­ca­tion in pro­duc­tion.

We will cov­er top­ics includ­ing:

  • authen­ti­ca­tion & autho­riza­tion with API Gate­way & Cog­ni­to
  • test­ing & run­ning func­tions local­ly
  • CI/CD
  • log aggre­ga­tion
  • mon­i­tor­ing best prac­tices
  • dis­trib­uted trac­ing with X-Ray
  • track­ing cor­re­la­tion IDs
  • per­for­mance & cost opti­miza­tion
  • error han­dling
  • con­fig man­age­ment
  • canary deploy­ment
  • VPC
  • secu­ri­ty
  • lead­ing prac­tices for Lamb­da, Kine­sis, and API Gate­way

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).