Yubl’s road to Serverless — building better recommendations with Lambda, BigQuery and GrapheneDB

part 1: overview

part 2: test­ing and CI/CD

part 3: ops

part 4: build­ing a scal­able noti­fi­ca­tion sys­tem

part 5: build­ing a bet­ter rec­om­men­da­tion sys­tem <- you’re here


When I joined Yubl in April 2016, it had launched just 2 months ear­li­er, after a long and chaot­ic devel­op­ment cycle that last­ed more than 2 years — all the while there was a ful­ly armed sales team before there was even a prod­uct!

Some seri­ous­ly bad deci­sions hap­pened at Yubl.. and judg­ing by Sil­i­con Val­ley this kind of deci­sion mak­ing is far more com­mon than we realised.

That said, many good things also hap­pened at Yubl, and I had the plea­sure to work with some of the best peo­ple I have met in my career. This post is about one of the ail­ing fea­tures we were able to quick­ly turn around with the pow­er of AWS Lamb­da and using the right tool for the job.

Animated GIF - Find & Share on GIPHY

Fans of Sil­i­con Val­ley prob­a­bly remem­ber that scene from Sea­son 3 when Richard and co walked into their shiny new Pipe Piper office to find “Action” Jack Bark­er had hired an army of sales peo­ple before they even had a prod­uct.


A Broken Feature

Upon join­ing the com­pa­ny, I found out the app already had a Find Peo­ple fea­ture although it didn’t do what I expect­ed. The likes of Twit­ter and Face­book would employ sophis­ti­cat­ed algo­rithms to find peo­ple with shared inter­est to you. Our fea­ture on the oth­er hand would return the first 30 users in Mon­goDB that you aren’t already fol­low­ing, by the order of account cre­ation time. For most users this list would equate to the first 30 Yubl employ­ees that installed the app… talk about rig­ging the game!

One of the devs made a valiant attempt to improve the fea­ture by return­ing only users who have shared con­nec­tions with you — either you both fol­low X or you are both fol­lowed by X.

How­ev­er, the imple­men­ta­tion was a series of expen­sive (and com­pli­cat­ed) Mon­goDB queries per user request. Ulti­mate­ly it was an approach that would not scale with through­put nor com­plex­i­ty as it’s using the wrong tool for the job.

Lambda + GrapheneDB = Efficient Graph Queries

I had pre­vi­ous­ly worked with Neo4j at Gamesys and used it to ana­lyze and mod­el the com­plex in-game econ­o­my of a MMORPG.

A graph data­base like Neo4j is the per­fect place to store our social graph, and allows us to effi­cient­ly per­form the kind of graph queries we need in order to find users you should fol­low, eg. 2nd/3rd degree con­nec­tions.

GrapheneDB offers host­ed Neo4j data­base as a ser­vice, with built-in mon­i­tor­ing, dash­boards, auto­mat­ed back­up and scal­ing up. It was the per­fect choice to get us going and start deliv­er­ing val­ue to our users quick­ly.

At this point in time we were already stream­ing all state changes in the sys­tem into Kine­sis. To export all of our social graph into GrapheneDB and to keep it in sync with Mon­goDB we:

  1. ran a one-off task to export all the rela­tion­ship data into GrapheneDB
  2. sub­scribed a Lamb­da func­tion to the Relationship Kine­sis stream to process any sub­se­quent rela­tion­ship changes and update the social graph (in GrapheneDB) in real time

We then exposed the data via API Gate­way and Lamb­da so that the client app and oth­er inter­nal ser­vices can use it to eas­i­ly find sug­gest­ed users for a user to fol­low.

Future Plans

Giv­en the lim­i­ta­tion that Neo4j requires all of your graph to be stored on one machine (and it has pret­ty tax­ing hard­ware require­ment too) it was not the long term solu­tion for us.

Based on my esti­mates, the biggest instance avail­able on GrapheneDB would suf­fice until we have more than 10M users. It was cal­cu­lat­ed based on the aver­age no. of con­nec­tions per user in our plat­form and using Twitter’s user stats as a guide­line for where we might be at 10M users.

We can push that ceil­ing much fur­ther by mov­ing to a batch mod­el and pre­process rec­om­men­da­tions for each user to reduce the no. of live queries against a large graph. The rec­om­men­da­tions can be restrict­ed to active (eg. users that have logged in in the last X days) users only, and only when:

  • the rec­om­men­da­tions are stale, ie. not act­ed upon by the user for more than X days so they might not be what the user wants; or
  • when the user’s extend­ed social graph has changed, ie. followers/followees have new con­nec­tions

From what I was able to gath­er, all the big social net­works use a batch mod­el for scal­a­bil­i­ty and cost rea­sons.

As for a long term solu­tions, we hadn’t set­tled on any­thing. I looked at Facebook’s Giraph briefly but it’s far more sophis­ti­cat­ed than we were ready for. There are oth­er “fan­ta­sy” ideas like the Mosa­ic sys­tem described in this paper. It would have been a fan­tas­tic chal­lenge had we got that far.

Finding Trending Users

Because we were still a small social net­work — with just over 800K installs, it’s not suf­fi­cient to make rec­om­men­da­tions based on a user’s social graph alone as most users have a pret­ty small social graph.

To bridge the gap we decid­ed to also include trend­ing users on the plat­form in your rec­om­men­da­tions.

Thank­ful­ly, all of our events (eg. X fol­lowed Y, X liked Y’s post, etc.) are streamed into Google Big­Query. We chose Big­Query because AWS Athena hadn’t been announced yet and Red­Shift is not the right mod­el for mak­ing ad-hoc, live queries that need to respond quick­ly. Also, I had many years of expe­ri­ence using Big­Query at Gamesys so it was a no-brain­er at the time.

ps. if you’re curi­ous about the dif­fer­ence between Athena and Big­Query, Lynn Lan­git gave a com­pre­hen­sive com­par­i­son at Server­less Austin this year.

To find trend­ing users, we worked with the prod­uct team to cre­ate a for­mu­la to cal­cu­late a user’s “trendi­ness” based on no. of new fol­low­ers in the last 24 hours. The fol­low­er count is weight­ed expo­nen­tial­ly by how recent­ly the user was fol­lowed. For instance, a fol­low­er that fol­lowed you in the past hour gives you a score of 1, but a fol­low­er that fol­lowed you 3 hours ago would only earn you a score of 0.1.

We cre­at­ed a cron job with Cloud­Watch Events and Lamb­da to per­form the afore­men­tioned query against Big­Query every 3 hours. To save on cost, our query would only process events that were insert­ed in the last 24 hours.

The result are then saved into a DynamoDB table, which is over­writ­ten at the end of each run.

Once again, we exposed the data via API Gate­way and Lamb­da.

Migration to new APIs

Now, we have 2 new APIs to pro­vide live sug­ges­tions based on a user’s social graph, and to find users who are cur­rent­ly trend­ing on our plat­form.

How­ev­er, the client apps would need to be updat­ed to take advan­tage of these new APIs. Instead of wait­ing for the client teams to catch up, we updat­ed the lega­cy API’s sug­ges­tion end­point to use results from both so we can pro­vide val­ue to our users ear­li­er.

The lead time to some­one say­ing thank you is the only rep­u­ta­tion met­ric that mat­ters.”

- Dan North

This is how it looks when we put every­thing togeth­er:

One of the most sat­is­fy­ing aspect of this work was how quick­ly we were able to turn this fea­ture around and deploy the new sys­tem into pro­duc­tion. Every­thing came togeth­er in less than 2 weeks, which is large­ly because we were able to focus on our busi­ness needs and let ser­vices such as Lamb­da, Big­Query and GrapheneDB deal with the undif­fer­en­ti­at­ed efforts.

Like what you’re read­ing? Check out my video course Pro­duc­tion-Ready Server­less and learn the essen­tials of how to run a server­less appli­ca­tion in pro­duc­tion.

We will cov­er top­ics includ­ing:

  • authen­ti­ca­tion & autho­riza­tion with API Gate­way & Cog­ni­to
  • test­ing & run­ning func­tions local­ly
  • CI/CD
  • log aggre­ga­tion
  • mon­i­tor­ing best prac­tices
  • dis­trib­uted trac­ing with X-Ray
  • track­ing cor­re­la­tion IDs
  • per­for­mance & cost opti­miza­tion
  • error han­dling
  • con­fig man­age­ment
  • canary deploy­ment
  • VPC
  • secu­ri­ty
  • lead­ing prac­tices for Lamb­da, Kine­sis, and API Gate­way

You can also get 40% off the face price with the code ytcui. Hur­ry though, this dis­count is only avail­able while we’re in Manning’s Ear­ly Access Pro­gram (MEAP).