Running and debugging AWS Lambda functions locally with the Serverless framework and VS Code

One of the complaints developers often have for AWS Lambda is the inability to run and debug functions locally. For Node.js at least, the Serverless framework and VS Code provides a good solution for doing just that.

An often underused feature of the Serverless framework is the invoke local command, which runs your code locally by emulating the AWS Lambda environment. Granted, it’s not a perfect simulation and only works with Node.js and Python, but it has been good enough for most of local development needs.

With VS Code, you have the ability to debug Node.js applications, including the option to launch an external program.

Put the two together and you have the ability to locally run and debug your Lambda functions.

Step 1 : install Serverless framework as dev dependency

In general, it’s a good idea to install Serverless framework as a dev dependency in a project because:

  1. it allows other developers (and the CI server) to use the Serverless framework for deployment without having to install it themselves
  2. it prevents incompatibility issues when you have an incompatible version of Serverless framework installed to that used by the serverless.yml file in the project
  3. since Serverless v1.16.0 dev dependencies are excluded from the deployment package so it wouldn’t add to your deployment size (this is broken in the current version v1.18.0 but should be fixed shortly)

Step 2 : add debug configuration

Invoke the “sls invoke local” CLI command against the “hello” function with an empty object {} as input. It’s also possible to invoke the function with a JSON file, see doc here.

Step 3 : enjoy!

There, nice and easy :-)

Couple of things to note:

  • if your function depends on environment variables, then you can set those up in the launch.json config file in step 2
  • if your function needs to access other AWS resources, then you also need to setup the relevant environment variables (eg. AWS_PROFILE) for the aws-sdk to access those resources in the correct AWS account
  • this approach will not work for recursive functions (well, the recursion will happen on the deployed Lambda function, so you won’t be able to debug it)

I’m running a live course on designing serverless architecture with AWS Lambda

Hi everyone, just a quick note to let you know that I’m running a live online course with O’Reilly on designing serverless architectures with AWS Lambda.

It’s a 2-day course on September 11-12th with 6 hours in total, and it’s available for free if you have a subscription with SafariBooksOnline. Registration for the course is open till September 7th, so if there are still spaces available then you can even sign up for a free 10-day trial on SafariBooksOnline just before registration closes and get the course for free.

Sign up here.

The course will cover a variety of topics over the two days:

  • AWS Lambda basics
  • the Serverless framework
  • testing strategies
  • CI/CD
  • centralised logging
  • distributed tracing
  • monitoring
  • performance considerations including cold starts
  • config management
  • Lambda in VPC
  • security
  • best practices with API Gateway and Kinesis
  • step functions
  • explore several design patterns with Lambda

Slides for my serverless security talk

Applying the Saga pattern with AWS Lambda and Step Functions

The Saga pattern is a pattern for managing failures, where each action has a compensating action for rollback.

In Hector Garcia-Molina’s 1987 paper, it is described as an approach to handling system failures in a long-running transactions.

It has become increasingly relevant in the world of microservices as application logic often needs to transact across multiple bounded contexts – each encapsulated by its own microservice with independent databases. Caitie McCaffrey gave a good talk on using the Saga pattern in distributed systems, which you can watch here.

Using Caitie’s example from her talk, suppose we have a transaction that goes something like this:

Begin transaction
    Start book hotel request
    End book hotel request
    Start book flight request
    End book flight request
    Start book car rental request
    End book car rental request
End transaction

We can model each of the actions (and their compensating actions) with a Lambda function, and use a state machine in Step Function as the coordinator for the saga.

Because the compensating actions can also fail so we need to be able to retry them until success, which means they have to be idempotent.

In the example below, we’ll implement backward recovery in the event of a failure.

Each Lambda function expects the input to be in the following shape.

{
  "trip_id": "5c12d94a-ee6a-40d9-889b-1d49142248b7",
  "depart": "London",
  "depart_at": "2017-07-10T06:00:00.000Z",
  "arrive": "Dublin",
  "arrive_at": "2017-07-12T08:00:00.000Z",
  "hotel": "holiday inn",
  "check_in": "2017-07-10T12:00:00.000Z",
  "check_out": "2017-07-12T14:00:00.000Z",
  "rental": "Volvo",
  "rental_from": "2017-07-10T00:00:00.000Z",
  "rental_to": "2017-07-12T00:00:00.000Z"
}

Inside each of the functions is a simple PutItem request against a different DynamoDB table. The corresponding compensating function will perform a DeleteItem against the corresponding table to rollback the PutItem action.

The state machine pass the same input to each action in turn:

  1. BookHotel
  2. BookFlight
  3. BookRental

and record their results at a specific path (so to avoid overriding the input $ that will be passed to the next function).

In this naive implementation, we’ll apply the compensating action for any failure – hence the State.ALL below. In practice, you should consider giving certain error types a retry – eg. temporal errors such as DynamoDB’s provision throughput exceeded exceptions.

Success Case

Following the happy path, each of the actions are performed in turn and the state machine will end successfully.

Failure Cases

When failures strike, depending on where the failure occurs we need to apply the corresponding compensating actions in turn.

In the examples below, if the failure happened at BookFlight, then both CancelFlight and CancelHotel will be executed to rollback any changes performed thus far.

Similar, if the failure happened at BookRental, then all three compensating actions – CancelRental, CancelFlight and CancelHotel – will be executed in that order to rollback all the state changes from the transaction.

Each compensating action also have an infinite retry loop! In practice, there should be a reasonable upper limit on the no. of retries before you alert for human intervention.

You can find the source code for this demo here.

Yubl’s road to Serverless – Part 5 – building better recommendations with Lambda, BigQuery and GrapheneDB

Note: see here for the rest of the series.

 

When I joined Yubl in April 2016, it had launched just 2 months earlier, after a long and chaotic development cycle that lasted more than 2 years – all the while there was a fully armed sales team before there was even a product!

Some seriously bad decisions happened at Yubl.. and judging by Silicon Valley this kind of decision making is far more common than we realised.

That said, many good things also happened at Yubl, and I had the pleasure to work with some of the best people I have met in my career. This post is about one of the ailing features we were able to quickly turn around with the power of AWS Lambda and using the right tool for the job.

Animated GIF  - Find & Share on GIPHY

Fans of Silicon Valley probably remember that scene from Season 3 when Richard and co walked into their shiny new Pipe Piper office to find “Action” Jack Barker had hired an army of sales people before they even had a product.


A Broken Feature

Upon joining the company, I found out the app already had a Find People feature although it didn’t do what I expected. The likes of Twitter and Facebook would employ sophisticated algorithms to find people with shared interest to you. Our feature on the other hand would return the first 30 users in MongoDB that you aren’t already following, by the order of account creation time. For most users this list would equate to the first 30 Yubl employees that installed the app… talk about rigging the game!

One of the devs made a valiant attempt to improve the feature by returning only users who have shared connections with you – either you both follow X or you are both followed by X.

However, the implementation was a series of expensive (and complicated) MongoDB queries per user request. Ultimately it was an approach that would not scale with throughput nor complexity as it’s using the wrong tool for the job.

Lambda + GrapheneDB = Efficient Graph Queries

I had previously worked with Neo4j at Gamesys and used it to analyze and model the complex in-game economy of a MMORPG.

A graph database like Neo4j is the perfect place to store our social graph, and allows us to efficiently perform the kind of graph queries we need in order to find users you should follow, eg. 2nd/3rd degree connections.

GrapheneDB offers hosted Neo4j database as a service, with built-in monitoring, dashboards, automated backup and scaling up. It was the perfect choice to get us going and start delivering value to our users quickly.

At this point in time we were already streaming all state changes in the system into Kinesis. To export all of our social graph into GrapheneDB and to keep it in sync with MongoDB we:

  1. ran a one-off task to export all the relationship data into GrapheneDB
  2. subscribed a Lambda function to the Relationship Kinesis stream to process any subsequent relationship changes and update the social graph (in GrapheneDB) in real time

We then exposed the data via API Gateway and Lambda so that the client app and other internal services can use it to easily find suggested users for a user to follow.

Future Plans

Given the limitation that Neo4j requires all of your graph to be stored on one machine (and it has pretty taxing hardware requirement too) it was not the long term solution for us.

Based on my estimates, the biggest instance available on GrapheneDB would suffice until we have more than 10M users. It was calculated based on the average no. of connections per user in our platform and using Twitter’s user stats as a guideline for where we might be at 10M users.

We can push that ceiling much further by moving to a batch model and preprocess recommendations for each user to reduce the no. of live queries against a large graph. The recommendations can be restricted to active (eg. users that have logged in in the last X days) users only, and only when:

  • the recommendations are stale, ie. not acted upon by the user for more than X days so they might not be what the user wants; or
  • when the user’s extended social graph has changed, ie. followers/followees have new connections

From what I was able to gather, all the big social networks use a batch model for scalability and cost reasons.

As for a long term solutions, we hadn’t settled on anything. I looked at Facebook’s Giraph briefly but it’s far more sophisticated than we were ready for. There are other “fantasy” ideas like the Mosaic system described in this paper. It would have been a fantastic challenge had we got that far.

Finding Trending Users

Because we were still a small social network – with just over 800K installs, it’s not sufficient to make recommendations based on a user’s social graph alone as most users have a pretty small social graph.

To bridge the gap we decided to also include trending users on the platform in your recommendations.

Thankfully, all of our events (eg. X followed Y, X liked Y’s post, etc.) are streamed into Google BigQuery. We chose BigQuery because AWS Athena hadn’t been announced yet and RedShift is not the right model for making ad-hoc, live queries that need to respond quickly. Also, I had many years of experience using BigQuery at Gamesys so it was a no-brainer at the time.

ps. if you’re curious about the difference between Athena and BigQuery, Lynn Langit gave a comprehensive comparison at Serverless Austin this year.

To find trending users, we worked with the product team to create a formula to calculate a user’s “trendiness” based on no. of new followers in the last 24 hours. The follower count is weighted exponentially by how recently the user was followed. For instance, a follower that followed you in the past hour gives you a score of 1, but a follower that followed you 3 hours ago would only earn you a score of 0.1.

We created a cron job with CloudWatch Events and Lambda to perform the aforementioned query against BigQuery every 3 hours. To save on cost, our query would only process events that were inserted in the last 24 hours.

The result are then saved into a DynamoDB table, which is overwritten at the end of each run.

Once again, we exposed the data via API Gateway and Lambda.

Migration to new APIs

Now, we have 2 new APIs to provide live suggestions based on a user’s social graph, and to find users who are currently trending on our platform.

However, the client apps would need to be updated to take advantage of these new APIs. Instead of waiting for the client teams to catch up, we updated the legacy API’s suggestion endpoint to use results from both so we can provide value to our users earlier.

“The lead time to someone saying thank you is the only reputation metric that matters.”

– Dan North

This is how it looks when we put everything together:

One of the most satisfying aspect of this work was how quickly we were able to turn this feature around and deploy the new system into production. Everything came together in less than 2 weeks, which is largely because we were able to focus on our business needs and let services such as Lambda, BigQuery and GrapheneDB deal with the undifferentiated efforts.