Yubl’s road to Serverless – Part 5 – building better recommendations with Lambda, BigQuery and GrapheneDB

Note: see here for the rest of the series.

 

When I joined Yubl in April 2016, it had launched just 2 months earlier, after a long and chaotic development cycle that lasted more than 2 years – all the while there was a fully armed sales team before there was even a product!

Some seriously bad decisions happened at Yubl.. and judging by Silicon Valley this kind of decision making is far more common than we realised.

That said, many good things also happened at Yubl, and I had the pleasure to work with some of the best people I have met in my career. This post is about one of the ailing features we were able to quickly turn around with the power of AWS Lambda and using the right tool for the job.

Animated GIF  - Find & Share on GIPHY

Fans of Silicon Valley probably remember that scene from Season 3 when Richard and co walked into their shiny new Pipe Piper office to find “Action” Jack Barker had hired an army of sales people before they even had a product.


A Broken Feature

Upon joining the company, I found out the app already had a Find People feature although it didn’t do what I expected. The likes of Twitter and Facebook would employ sophisticated algorithms to find people with shared interest to you. Our feature on the other hand would return the first 30 users in MongoDB that you aren’t already following, by the order of account creation time. For most users this list would equate to the first 30 Yubl employees that installed the app… talk about rigging the game!

One of the devs made a valiant attempt to improve the feature by returning only users who have shared connections with you – either you both follow X or you are both followed by X.

However, the implementation was a series of expensive (and complicated) MongoDB queries per user request. Ultimately it was an approach that would not scale with throughput nor complexity as it’s using the wrong tool for the job.

Lambda + GrapheneDB = Efficient Graph Queries

I had previously worked with Neo4j at Gamesys and used it to analyze and model the complex in-game economy of a MMORPG.

A graph database like Neo4j is the perfect place to store our social graph, and allows us to efficiently perform the kind of graph queries we need in order to find users you should follow, eg. 2nd/3rd degree connections.

GrapheneDB offers hosted Neo4j database as a service, with built-in monitoring, dashboards, automated backup and scaling up. It was the perfect choice to get us going and start delivering value to our users quickly.

At this point in time we were already streaming all state changes in the system into Kinesis. To export all of our social graph into GrapheneDB and to keep it in sync with MongoDB we:

  1. ran a one-off task to export all the relationship data into GrapheneDB
  2. subscribed a Lambda function to the Relationship Kinesis stream to process any subsequent relationship changes and update the social graph (in GrapheneDB) in real time

We then exposed the data via API Gateway and Lambda so that the client app and other internal services can use it to easily find suggested users for a user to follow.

Future Plans

Given the limitation that Neo4j requires all of your graph to be stored on one machine (and it has pretty taxing hardware requirement too) it was not the long term solution for us.

Based on my estimates, the biggest instance available on GrapheneDB would suffice until we have more than 10M users. It was calculated based on the average no. of connections per user in our platform and using Twitter’s user stats as a guideline for where we might be at 10M users.

We can push that ceiling much further by moving to a batch model and preprocess recommendations for each user to reduce the no. of live queries against a large graph. The recommendations can be restricted to active (eg. users that have logged in in the last X days) users only, and only when:

  • the recommendations are stale, ie. not acted upon by the user for more than X days so they might not be what the user wants; or
  • when the user’s extended social graph has changed, ie. followers/followees have new connections

From what I was able to gather, all the big social networks use a batch model for scalability and cost reasons.

As for a long term solutions, we hadn’t settled on anything. I looked at Facebook’s Giraph briefly but it’s far more sophisticated than we were ready for. There are other “fantasy” ideas like the Mosaic system described in this paper. It would have been a fantastic challenge had we got that far.

Finding Trending Users

Because we were still a small social network – with just over 800K installs, it’s not sufficient to make recommendations based on a user’s social graph alone as most users have a pretty small social graph.

To bridge the gap we decided to also include trending users on the platform in your recommendations.

Thankfully, all of our events (eg. X followed Y, X liked Y’s post, etc.) are streamed into Google BigQuery. We chose BigQuery because AWS Athena hadn’t been announced yet and RedShift is not the right model for making ad-hoc, live queries that need to respond quickly. Also, I had many years of experience using BigQuery at Gamesys so it was a no-brainer at the time.

ps. if you’re curious about the difference between Athena and BigQuery, Lynn Langit gave a comprehensive comparison at Serverless Austin this year.

To find trending users, we worked with the product team to create a formula to calculate a user’s “trendiness” based on no. of new followers in the last 24 hours. The follower count is weighted exponentially by how recently the user was followed. For instance, a follower that followed you in the past hour gives you a score of 1, but a follower that followed you 3 hours ago would only earn you a score of 0.1.

We created a cron job with CloudWatch Events and Lambda to perform the aforementioned query against BigQuery every 3 hours. To save on cost, our query would only process events that were inserted in the last 24 hours.

The result are then saved into a DynamoDB table, which is overwritten at the end of each run.

Once again, we exposed the data via API Gateway and Lambda.

Migration to new APIs

Now, we have 2 new APIs to provide live suggestions based on a user’s social graph, and to find users who are currently trending on our platform.

However, the client apps would need to be updated to take advantage of these new APIs. Instead of waiting for the client teams to catch up, we updated the legacy API’s suggestion endpoint to use results from both so we can provide value to our users earlier.

“The lead time to someone saying thank you is the only reputation metric that matters.”

– Dan North

This is how it looks when we put everything together:

One of the most satisfying aspect of this work was how quickly we were able to turn this feature around and deploy the new system into production. Everything came together in less than 2 weeks, which is largely because we were able to focus on our business needs and let services such as Lambda, BigQuery and GrapheneDB deal with the undifferentiated efforts.

Serverless 1.X – enable API Gateway caching on request parameters

Having previously blogged about the untrodden path to enable caching on API Gateway request parameters in the Serverless framework 0.5.X, it’s a little disappointing that it’s still not officially fixed in the 1.X versions…

The Problem

The problem is two-fold:

  1. there’s currently no way to specify caching should be enabled for path & query string parameters
  2. the CloudFormation template Serverless 1.X generates for API Gateway is missing a few optional fields, these missing fields stop you from manually enable caching in the API Gateway management console too

After you deploy your Lambda function with associated API, if you go to the management console and enable caching on path or request parameters you will get an error saying “Invalid cache key parameter specified”.

The Workaround

A friend pointed me to a neat trick to modify the CloudFormation template that Serverless 1.X auto-generates for you.

After the project is deployed, you can go to CloudFormation and view the template that Serverless has generated. These templates are pretty big (and poorly formatted), so I find it easier to open them up in the Designer view and use that view to navigate to the endpoint I’m looking for.

Once you find the resource template for the endpoint, write down its name. Now go back to the serverless.yml file in your project, and add the resource name to the resources section at the bottom. You only need to include fields that you want to update or add to the template.

The CloudFormation syntax for an API Gateway method looks like this:

We also need to fill in some blanks for the Integration section:

For more details on the CloudFormation syntax, see here and here.

After some trial-and-error, the minimum set of fields I had to add are:

Redeploy with Serverless and the path parameter is enabled for caching:

Wrap Up

I hope you have found this post useful, though I’m surprised by the lack of information out there during my research and the lack of official support from the Serverless framework.

You know of a better way to do this, please let me know in the comments.

Links

AWS Lambda – build yourself a URL shortener in 2 hours

An interesting requirement came up at work this week where we discussed potentially having to run our own URL Shortener because the Universal Links mechanism (in iOS 9 and above) requires a JSON manifest at

https://domain.com/apple-app-site-association

Since the OS doesn’t follow redirects this manifest has to be hosted on the URL shortener’s root domain.

Owing to a limitation on AppsFlyer it’s currently not able to shorten links when you have Universal Links configured for your app. Whilst we can switch to another vendor it means more work for our (already stretched) client devs and we really like AppsFlyer‘s support for attributions.

Which brings us back to the question

“should we build a URL shortener?”

swiftly followed by

“how hard can it be to build a scalable URL shortener in 2017?”

Well, turns out it wasn’t hard at all 

Lambda FTW

For this URL shortener we’ll need several things:

  1. a GET /{shortUrl} endpoint that will redirect you to the original URL
  2. a POST / endpoint that will accept an original URL and return the shortened URL
  3. an index.html page where someone can easily create short URLs
  4. a GET /apple-app-site-association endpoint that serves a static JSON response

all of which can be accomplished with API Gateway + Lambda.

Overall, this is the project structure I ended up with:

  • using the Serverless framework’s aws-nodejs template
  • each of the above endpoint have a corresponding handler function
  • the index.html file is in the static folder
  • the test cases are written in such a way that they can be used both as integration as well as acceptance tests
  • there’s a build.sh script which facilitates running
    • integration tests, eg ./build.sh int-test {env} {region} {aws_profile}
    • acceptance tests, eg ./build.sh acceptance-test {env} {region} {aws_profile}
    • deployment, eg ./build.sh deploy {env} {region} {aws_profile}

Get /apple-app-site-association endpoint

Seeing as this is a static JSON blob, it makes sense to precompute the HTTP response and return it every time.

POST / endpoint

For an algorithm to shorten URLs, you can find a very simple and elegant solution on StackOverflow. All you need is an auto-incremented ID, like the ones you normally get with RDBMS.

However, I find DynamoDB a more appropriate DB choice here because:

  • it’s a managed service, so no infrastructure for me to worry about
  • OPEX over CAPEX, man!
  • I can scale reads & writes throughput elastically to match utilization level and handle any spikes in traffic

but, DynamoDB has no such concept as an auto-incremented ID which the algorithm needs. Instead, you can use an atomic counter to simulate an auto-incremented ID (at the expense of an extra write-unit per request).

GET /{shortUrl} endpoint

Once we have the mapping in a DynamoDB table, the redirect endpoint is a simple matter of fetching the original URL and returning it as part of the Location header.

Oh, and don’t forget to return the appropriate HTTP status code, in this case a 308 Permanent Redirect.

GET / index page

Finally, for the index page, we’ll need to return some HTML instead (and a different content-type to go with the HTML).

I decided to put the HTML file in a static folder, which is loaded and cached the first time the function is invoked.

Getting ready for production

Fortunately I have had plenty of practice getting Lambda functions to production readiness, and for this URL shortener we will need to:

  • configure auto-scaling parameters for the DynamoDB table (which we have an internal system for managing the auto-scaling side of things)
  • turn on caching in API Gateway for the production stage

Future Improvements

If you put in the same URL multiple times you’ll get back different short-urls, one optimization (for storage and caching) would be to return the same short-url instead.

To accomplish this, you can:

  1. add GSI to the DynamoDB table on the longUrl attribute to support efficient reverse lookup
  2. in the shortenUrl function, perform a GET with the GSI to find existing short url(s)

I think it’s better to add a GSI than to create a new table here because it avoids having “transactions” that span across multiple tables.

Useful Links

Slides and recording of my Lambda talk at LeetSpeak 2016

Serverless – enable caching on query string parameters in API Gateway

Since I started working at Yubl less than 2 weeks ago, I have been doing a lot of work with Amazon API Gateway & Lambda with the help of the Serverless framework. So far that experience has been really great.

One little caveat I ran into was that, it wasn’t clear on how to enable caching on query string and request path parameters. For instance, if I declare in my s-function.json file that my API Gateway endpoint has a query string parameter called query:

"requestParameters": {
    "integration.request.querystring.query": "method.request.querystring.query"
},

When the endpoint is deployed, you will see in the API Gateway console that ‘Caching’ is not enabled:

f0c8f61c-03fc-11e6-8aef-3f2b197f09b6

Which means when I enable caching on my deployed stage, the query string parameter would not be taken into account.

api-gateway-dev-stage

Users visiting the following URLs would get the same cached response back:

    https://my-awesome-service.com/yubl?query=one-awesome-yubl

    https://my-awesome-service.com/yubl?query=another-awesome-yubl

That’s obviously not OK.

Sadly, the same applied to request path parameters too:

"path": "yubl/{query}",
"method": "GET",
"type": "AWS",

07d58dfa-03ff-11e6-9869-46d4de5c13a2

Thankfully, the guys working on Serverless has been very actively in responding to their issues page and one of the team members picked up my ticket and offered some insight on how to make this work.

API Gateway’s own documentation eludes to two Integration parameters called cacheNamespace and cacheKeyParameters. Although, to say they’re poorly documented would be an understatement…

api-gateway-cache-params

So, after some fiddling around:

  1. create an API with cached query string param
  2. deploy it
  3. export as Swagger + API Gateway Extensions
  4. inspect the JSON to see what value is outputted for cacheNamespace and cacheKeyParameters

api-gateway-cache-params-example

Interestingly, when I put the above value – “method.input.params.query” – in my s-function.json I get the following error during deployment:

Invalid cache key parameter specified

what did work though, is the same value I had specified in my requestParameters dictionary, which is “method.request.querystring.query” in this case. ie

"requestParameters": {
    "integration.request.querystring.query": "method.request.querystring.query"
},
"cacheKeyParametes" [
    "method.request.querystring.query"
]

However, applying the same approaches to request path parameters proved fruitless.. I guess we’ll just have to wait for official documentation to be updated to see how it should work, which should be soon by the sound of things!

UPDATE 17/08/2016 : shortly after I posted this I did find a way to get request path parameters working too (which might have been a result of a Serverless update, though I can’t remember now). For request path parameters you need something along the lines of:

"requestParameters": {
    "integration.request.path.otherUserId": "method.request.path.id"
},
"cacheKeyParametes" [
    "method.request.path.id"
]