Serverless 1.X – enable API Gateway caching on request parameters

Having previously blogged about the untrodden path to enable caching on API Gateway request parameters in the Serverless framework 0.5.X, it’s a little disappointing that it’s still not officially fixed in the 1.X versions…

The Problem

The problem is two-fold:

  1. there’s currently no way to specify caching should be enabled for path & query string parameters
  2. the CloudFormation template Serverless 1.X generates for API Gateway is missing a few optional fields, these missing fields stop you from manually enable caching in the API Gateway management console too

After you deploy your Lambda function with associated API, if you go to the management console and enable caching on path or request parameters you will get an error saying “Invalid cache key parameter specified”.

The Workaround

A friend pointed me to a neat trick to modify the CloudFormation template that Serverless 1.X auto-generates for you.

After the project is deployed, you can go to CloudFormation and view the template that Serverless has generated. These templates are pretty big (and poorly formatted), so I find it easier to open them up in the Designer view and use that view to navigate to the endpoint I’m looking for.

Once you find the resource template for the endpoint, write down its name. Now go back to the serverless.yml file in your project, and add the resource name to the resources section at the bottom. You only need to include fields that you want to update or add to the template.

The CloudFormation syntax for an API Gateway method looks like this:

We also need to fill in some blanks for the Integration section:

For more details on the CloudFormation syntax, see here and here.

After some trial-and-error, the minimum set of fields I had to add are:

Redeploy with Serverless and the path parameter is enabled for caching:

Wrap Up

I hope you have found this post useful, though I’m surprised by the lack of information out there during my research and the lack of official support from the Serverless framework.

You know of a better way to do this, please let me know in the comments.

Links

Yubl’s road to Serverless architecture – Part 4 – building a scalable push notification system

The Road So Far

part 1 : overview

part 2 : testing and continuous delivery strategies

part 3 : ops

 

Just before Yubl’s untimely demise we did an interesting piece of work to redesign the system for sending targeted push notifications to our users to improve retention.

The old system relied on MixPanel for both selecting users as well as sending out the push notifications. Whilst MixPanel was great for getting us basic analytics quickly, we soon found our use cases outgrew MixPanel. The most pressing limitation was that we were not able to query users based on their social graph to create target push notifications – eg. notify an influencer’s followers when he/she publishes a new post or runs a new social media campaign.

Since all of our analytics events are streamed to Google BigQuery (using a combination of Kinesis Firehose, S3 and Lambda) we have all the data we need to support the complex use cases the product team has.

What we needed, was a push notification system that can integrate with BigQuery results and is capable of sending millions of push notifications in a batch.

Design Goals

From a high level, we need to support 2 types of notifications.

Ad-hoc notifications are driven by the marketing team, working closely with influencers and the BI team to match users with influencers or contents that they might be interested in. Example notifications include:

  • users who follow Accessorize and other fashion brands might be interested to know when another notable fashion brand joins the platform
  • users who follow an influencer might be interested to know when the influencer publishes a new post or is running a social media campaign (usually with give-away prizes, etc.)
  • users who have shared/liked music related contents might be interested to know that Tinie Tempah has joined the platform

Scheduled notifications are driven by the product team, these notifications are designed to nudge users to finish the sign up process or to come back to the platform after they have lapsed. Example notifications include:

  • day-1 unfinished sign up : notify users who didn’t finish the sign up process to come back to complete the process
  • day-2 engagement : notify users to come back and follow more people or invite friends on day 2
  • day-21 inactive : notify users who have not logged into the app for 21 days to come back and check out what’s new

A/B testing

For the scheduled notifications, we want to test out different messages/layouts to optimise their effectiveness over time. To do that, we wanted to support A/B testing as part of the new system (which MixPanel already supports).

We should be able to create multiple variants (each with a percentage), along with a control group who will not receive any push notifications.

Oversight vs Frictionless

For the ad-hoc notifications, we don’t want to get in the way of the marketing team doing their job, so the process for creating ad-hoc push notifications should be as frictionless as possible. However, we also don’t want the marketing team to operate completely without oversight and run the risk of long term damage by spamming users with unwanted push notifications (which might cause users to disable notifications or even rage quit the app).

The compromise we reached was an automated approval process whereby:

  1. the marketing team will work with BI on a query to identify users (eg. followers of Tinie Tempah)
  2. fill in a request form, which informs designated approvers via email
  3. approvers can send themselves a test push notification to see how it will be formatted on both Android and iOS
  4. approvers can approve or reject the request
  5. once approved, the request will be executed

Implementation

We decided to use S3 as the source for a send-batch-notifications function because it allows us to pass large list of users (remember, the goal is to support sending push notifications to millions of users in a batch) without having to worry about pagination or limits on payload size.

The function will work with any JSON file in the right format, and that JSON file can be generated in many ways:

  • by the cron jobs that generate scheduled notifications
  • by the approval system after an ad-hoc push notification is approved
  • by the approval system to send a test push notification to the approvers (to visually inspect how the message will appear on both Android and iOS devices)
  • by members of the engineering team when manual interventions are required

We also considered moving to SNS but decided against it in the end because it doesn’t provide useful enough an abstraction to justify the effort to migrate (involves client work) and the additional cost for sending push notifications. Instead, we used node-gcm and apn to communicate with GCM and APN directly.

Recursive Functions FTW

Lambda has a hard limit of 5 mins execution time (it might be softened in the near future), and that might not be enough time to send millions of push notifications.

Our approach to long-running tasks like this is to run the Lambda function as a recursive function.

A naive recursive function would process the payload in fixed size batches and recurse at the end of each batch whilst passing along a token/position to allow the next invocation to continue from where it left off. In this particular case, we have additional considerations because the total number of work items can be very large:

  • minimising the no. of recursions required, which equates to no. of Invoke requests to Lambda and carries a cost implication at scale
  • caching the content of the JSON file to improve performance (by avoiding loading and parsing a large JSON file more than once) and reduce S3 cost

To minimise the no. of recursions, our function would:

  1. process the list of users in small batches of 500
  2. at the end of each batch, call context.getRemainingTimeInMillis() to check how much time is left in this invocation
  3. if there is more than 1 min left in the invocation then process another batch; otherwise recurse

When caching the content of the JSON file from S3, we also need to compare the ETAG to ensure that the content of the file hasn’t changed.

With this set up the system was able to easily handle JSON files with more than 1 million users during our load test (sorry Apple and Google for sending all those fake device tokens :-P).

Auto-scaling Kinesis streams with AWS Lambda

Following on from the last post where we discussed 3 useful tips for working effectively with Lambda and Kinesis, let’s look at how you can use Lambda to help you auto scale Kinesis streams.

Auto-scaling for DynamoDB and Kinesis are two of the most frequently requested features for AWS, as I write this post I’m sure the folks at AWS are working hard to make them happen. Until then, here’s how you can roll a cost effective solution yourself.

From a high level, we want to:

  • scale up Kinesis streams quickly to meet increases in load
  • scale down under-utilised Kinesis streams to save cost

Scaling Up

Reaction time is important for scaling up, and from personal experience I find polling CloudWatch metrics to be a poor solution because:

  • CloudWatch metrics are usually over a minute behind
  • depending on polling frequency, your reaction time is even further behind
  • high polling frequency has a small cost impact

sidebar: I briefly experimented with Kinesis scaling utility from AWS Labs before deciding to implement our own solution. I found that it doesn’t scale up fast enough because it uses this polling approach, and I had experienced similar issues around reaction time with dynamic-dynamodb too.


Instead, prefer a push-based approach using CloudWatch Alarms.

Whilst CloudWatch Alarms is not available as trigger to Lambda functions, you can use SNS as a proxy:

  1. add a SNS topic as notification target for CloudWatch Alarm
  2. add the SNS topic as trigger to a Lambda function to scale up the stream that has tripped the alarm

WHAT metrics?

You can use a number of metrics for triggering the scaling action, here are a few to consider.

WriteProvisionedThroughputExceeded (stream)

The simplest way is to scale up as soon as you’re throttled. With a stream-level metric you only need to set up the alarm once per stream and wouldn’t need to adjust the threshold value after each scaling action.

However, since you’re reusing the same CloudWatch Alarm you must remember to set its status to OK after scaling up.

IncomingBytes and/or IncomingRecords (stream)

You can scale up preemtively (before you’re actually throttled by the service) by calculating the provisioned throughput and then setting the alarm threshold to be, say 80% of the provisioned throughput. After all, this is exactly what we’d do for scaling EC2 clusters and the same principle applies here – why wait till you’re impacted by load when you can scale up just ahead of time?

However, we need to manage some additional complexities EC2 auto scaling service usually takes care of for us:

  • if we alarm on both IncomingBytes and IncomingRecords then it’s possible to overscale (impacts cost) if both triggers around the same time; this can be mitigated but it’s down to us to ensure only one scaling action can occur at once and that there’s a cooldown after each scaling activity
  • after each scaling activity, we need to recalculate the provisioned throughput and update the alarm threshold(s)

WriteProvisionedThroughputExceeded (shard)

IncomingBytes and/or IncomingRecords (shard)

With shard level metrics you get the benefit of knowing the shard ID (in the SNS message) so you can be more precise when scaling up by splitting specific shard(s). The downside is that you have to add or remove CloudWatch Alarms after each scaling action.

HOW to scale up

To actually scale up a Kinesis stream, you’ll need to increase the no. of active shards by splitting one of more of the existing shards. One thing to keep in mind is that once a shard is split into 2, it’s no longer ACTIVE but it will still be accessible for up to 7 days (depending on your retention policy setting) and you’ll still pay for it the whole time!

Broadly speaking, you have two options available to you:

  1. use UpdateShardCount and let Kinesis figure out how to do it
  2. choose one or more shards and split them yourself using SplitShard

Option 1 is far simpler but comes with some heavy baggage:

  • because it only supports UNIFORM_SCALING (at the time of writing) it means this action can result in many temporary shards being created unless you double up each time (remember, you’ll pay for all those temporary shards for up to 7 days)
  • doubling up can be really expensive at scale (and possibly unnecessary depending on load pattern)
  • plus all the other limitations

As for Option 2, if you’re using shard level metrics then you can split only the shards that have triggered the alarm(s). Otherwise, a simple strategy would be to sort the shards by their hash range and split the biggest shards first.

Scaling Down

To scale down a Kinesis stream you merge two adjacent shards. Just as splitting a shard leaves behind an inactive shard that you’ll still pay for, merging shards will leave behind two inactive shards!

Since scaling down is primarily a cost saving exercise, I strongly recommend that you don’t scale down too often as you could easily end up increasing your cost instead if you have to scale up soon after scaling down (hence leaving behind lots inactive shards).

Since we want to scale down infrequently, it makes more sense to do so with a cron job (ie. CloudWatch Event + Lmabda) than to use CloudWatch Alarms. As an example, after some trial and error we settled on scaling down once every 36 hours, which is 1.5x our retention policy of 24 hours.

WHICH stream

When the cron job runs, our Lambda function would iterate through all the Kinesis streams and for each stream:

  • calculate its provisioned throughput in terms of both bytes/s and records/s
  • get 5 min metrics (IncomingBytes and IncomingRecords) over the last 24 hours
  • if all the data points over the last 24 hours are below 50% of the provisioned throughput then scale down the stream

The reason we went with 5 min metrics is because that’s the granularity the Kinesis dashboard uses and allows me to validate my calculations (you don’t get bytes/s and records/s values from CloudWatch directly, but will need to calculate them yourself).

Also, we require all datapoints over the last 24 hours to be below the 50% threshold to be absolutely sure that utilization level is consistently below the threshold rather than a temporary blip (which could be a result of an outage for example).

HOW to scale down

We have the same trade-offs between using UpdateShardCount and doing-it-yourself with MergeShards as scaling up.

Wrapping Up

To set up the initial CloudWatch Alarms for a stream, we have a repo which hosts the configurations for all of our Kinsis streams, as well as a script for creating any missing streams and associated CloudWatch Alarms (using CloudFormation templates).

Additionally, as you can see from the screenshot above, the configuration file also specifies the min and max no. of shards for each Kinesis stream. When the create-streams script creates a new stream, it’ll be created with the specified desiredShards no. of shards.

 

Hope you enjoyed this post, please let me know in the comments below if you are doing something similar to auto-scale your Kinesis streams and if you have any experience you’d like to share.

 

Links

AWS Lambda —3 pro tips for working with Kinesis streams

At Yubl, we arrived at a non-trivial serverless architecture where Lambda and Kinesis became a prominent feature of this architecture.

Whilst our experience using Lambda with Kinesis was great in general, there was a couple of lessons that we had to learn along the way. Here are 3 useful tips to help you avoid some of the pitfalls we fell into and accelerate your own adoption of Lambda and Kinesis.

Consider partial failures

From the Lambda documentation:

AWS Lambda polls your stream and invokes your Lambda function. Therefore, if a Lambda function fails, AWS Lambda attempts to process the erring batch of records until the time the data expires…

Because the way Lambda functions are retried, if you allow your function to err on partial failures then the default behavior is to retry the entire batch until success or the data expires from the stream.

To decide if this default behavior is right for you, you have to answer certain questions:

  • can events be processed more than once?
  • what if those partial failures are persistent? (perhaps due to a bug in the business logic that is not handling certain edge cases gracefully)
  • is it more important to process every event till success than keeping the overall system real-time?

In the case of Yubl (which was a social networking app with a timeline feature similar to Twitter) we found that for most of our use cases it’s more important to keep the system flowing than to halt processing for any failed events, even if for a minute.

For instance, when you create a new post, we would distribute it to all of your followers by processing the yubl-posted event. The 2 basic choices we’re presented with are:

  1. allow errors to bubble up and fail the invocation—we give every event every opportunity to be processed; but if some events fail persistently then no one will receive new posts in their feed and the system appears unavailable
  2. catch and swallow partial failures—failed events are discarded, some users will miss some posts but the system appears to be running normally to users (even affected users might not realize that they had missed some posts)

(of course, it doesn’t have to be a binary choice, there’s plenty of room to add smarter handling for partial failures, which we will discuss shortly)

We encapsulated these 2 choices as part of our tooling so that we get the benefit of reusability and the developers can make an explicit choice (and the code makes that choice obvious to anyone reading it later on) for every Kinesis processor they create.

You would probably apply different choices depending on the problem you’re solving, the important thing is to always consider how partial failures would affect your system as a whole.

Use dead letter queues (DLQ)

AWS announced support for Dead Letter Queues (DLQ) at the end of 2016, however, at the time of writing this support only extends to asynchronous invocations (SNS, S3, IOT, etc.) but not poll-based invocations such as Kinesis and DynamoDB streams.

That said, there’s nothing stopping you from applying the DLQ concept yourself.

First, let’s roll back the clock to a time when we didn’t have Lambda. Back then, we’d use long running applications to poll Kinesis streams ourselves. Heck, I even wrote my own producer and consumer libraries because when AWS rolled out Kinesis they totally ignored anyone not running on the JVM!

Lambda has taken over a lot of the responsibilities—polling, tracking where you are in the stream, error handling, etc.—but as we have discussed above it doesn’t remove you from the need to think for yourself. Nor does it change what good looks like for a system that processes Kinesis events, which for me must have at least these 3 qualities:

  • it should be real-time (most domains consider real-time as “within a few seconds”)
  • it should retry failed events, but retries should not violate the realtime constraint on the system
  • it should be possible to retrieve events that could not be processed so someone can investigate root cause or provide manual intervention

Back then, my long running application would:

  1. poll Kinesis for events
  2. process the events by passing them to a delegate function (your code)
  3. failed events are retried 2 additional times
  4. after the 2 retries are exhausted, they are saved into a SQS queue
  5. record the last sequence number of the batch so that we don’t lose the current progress if the host VM dies or the application crashes
  6. another long running application (perhaps on another VM) would poll the SQS queue for events that couldn’t be process realtime
  7. process the failed events by passing them to the same delegate function as above (your code)
  8. after the max no. of retrievals the events are passed off to a DLQ
  9. this triggers CloudWatch alarms and someone can manually retrieve the event from the DLQ to investigate

A Lambda function that processes Kinesis events should also:

  • retry failed events X times depending on processing time
  • send failed events to a DLQ after exhausting X retries

Since SNS already comes with DLQ support, you can simplify your setup by sending the failed events to a SNS topic instead—Lambda would then process it a further 3 times before passing it off to the designated DLQ.

Avoid “hot” streams

We found that when a Kinesis stream has 5 or more Lambda function subscribers we would start to see lots ReadProvisionedThroughputExceeded errors in CloudWatch. Fortunately these errors are silent to us as they happen to (and are handled by) the Lambda service polling the stream.

However, we occasionally see spikes in the GetRecords.IteratorAge metric, which tells us that a Lambda function will sometimes lag behind. This did not happen frequently enough to present a problem but the spikes were unpredictable and did not correlate to spikes in traffic or number of incoming Kinesis events.

Increasing the no. of shards in the stream made matters worse and the no. of ReadProvisionedThroughputExceeded increased proportionally.


According to the Kinesis documentation

Each shard can support up to 5 transactions per second for reads, up to a maximum total data reads of 2 MB per second.

and Lambda documentation

If your stream has 100 active shards, there will be 100 Lambda functions running concurrently. Then, each Lambda function processes events on a shard in the order that they arrive.

One would assume that each of the aforementioned Lambda functions would be polling its shard independently. Since the problem is having too many Lambda functions poll the same shard, it makes sense that adding new shards will only escalate the problem further.


“All problems in computer science can be solved by another level of indirection.”

—David Wheeler

After speaking to the AWS support team about this, the only advice we received (and one that we had already considered) was to apply the fan out pattern by adding another layer of Lambda function who would distribute the Kinesis events to others.

Whilst this is simple to implement, it has some downsides:

  • it vastly complicates the logic for handling partial failures (see above)
  • all functions now process events at the rate of the slowest function, potentially damaging the realtime-ness of the system

We also considered and discounted several other alternatives, including:

  • have one stream per subscriber—this has a significant cost implication, and more importantly it means publishers would need to publish the same event to multiple Kinesis streams in a “transaction” with no easy way to rollback (since you can’t unpublish an event in Kinesis) on partial failures
  • roll multiple subscriber logic into one—this corrodes our service boundary as different subsystems are bundled together to artificially reduce the no. of subscribers

In the end, we didn’t find a truly satisfying solution and decided to reconsider if Kinesis was the right choice for our Lambda functions on a case by case basis.

  • for subsystems that do not have to be realtime, use S3 as source instead—all our Kinesis events are persisted to S3 via Kinesis Firehose, the resulting S3 files can then be processed by these subsystems, eg. Lambda functions that stream events to Google BigQuery for BI
  • for work that are task-based (ie, order is not important), use SNS/SQS as source instead—SNS is natively supported by Lambda, and we implemented a proof-of-concept architecture for processing SQS events with recursive Lambda functions, with elastic scaling; now that SNS has DLQ support (it was not available at the time) it would definitely be the preferred option provided that its degree of parallelism would not flood and overwhelm downstream systems such as databases, etc.
  • for everything else, continue to use Kinesis and apply the fan out pattern as an absolute last resort

Wrapping up…

So there you have it, 3 pro tips from a group of developers who have had the pleasure of working extensively with Lambda and Kinesis.

I hope you find this post useful, if you have any interesting observations or learning from your own experience working with Lambda and Kinesis, please share them in the comments section below.

Links

Yubl’s road to serverless — Part 1, Overview

Yubl’s road to serverless — Part 2, Testing and CI/CD

Yubl’s road to serverless — Part 3, Ops

AWS Lambda — use recursive functions to process SQS messages, Part 1

AWS Lambda — use recursive functions to process SQS messages, Part 2

AWS Lambda – build yourself a URL shortener in 2 hours

An interesting requirement came up at work this week where we discussed potentially having to run our own URL Shortener because the Universal Links mechanism (in iOS 9 and above) requires a JSON manifest at

https://domain.com/apple-app-site-association

Since the OS doesn’t follow redirects this manifest has to be hosted on the URL shortener’s root domain.

Owing to a limitation on AppsFlyer it’s currently not able to shorten links when you have Universal Links configured for your app. Whilst we can switch to another vendor it means more work for our (already stretched) client devs and we really like AppsFlyer‘s support for attributions.

Which brings us back to the question

“should we build a URL shortener?”

swiftly followed by

“how hard can it be to build a scalable URL shortener in 2017?”

Well, turns out it wasn’t hard at all 

Lambda FTW

For this URL shortener we’ll need several things:

  1. a GET /{shortUrl} endpoint that will redirect you to the original URL
  2. a POST / endpoint that will accept an original URL and return the shortened URL
  3. an index.html page where someone can easily create short URLs
  4. a GET /apple-app-site-association endpoint that serves a static JSON response

all of which can be accomplished with API Gateway + Lambda.

Overall, this is the project structure I ended up with:

  • using the Serverless framework’s aws-nodejs template
  • each of the above endpoint have a corresponding handler function
  • the index.html file is in the static folder
  • the test cases are written in such a way that they can be used both as integration as well as acceptance tests
  • there’s a build.sh script which facilitates running
    • integration tests, eg ./build.sh int-test {env} {region} {aws_profile}
    • acceptance tests, eg ./build.sh acceptance-test {env} {region} {aws_profile}
    • deployment, eg ./build.sh deploy {env} {region} {aws_profile}

Get /apple-app-site-association endpoint

Seeing as this is a static JSON blob, it makes sense to precompute the HTTP response and return it every time.

POST / endpoint

For an algorithm to shorten URLs, you can find a very simple and elegant solution on StackOverflow. All you need is an auto-incremented ID, like the ones you normally get with RDBMS.

However, I find DynamoDB a more appropriate DB choice here because:

  • it’s a managed service, so no infrastructure for me to worry about
  • OPEX over CAPEX, man!
  • I can scale reads & writes throughput elastically to match utilization level and handle any spikes in traffic

but, DynamoDB has no such concept as an auto-incremented ID which the algorithm needs. Instead, you can use an atomic counter to simulate an auto-incremented ID (at the expense of an extra write-unit per request).

GET /{shortUrl} endpoint

Once we have the mapping in a DynamoDB table, the redirect endpoint is a simple matter of fetching the original URL and returning it as part of the Location header.

Oh, and don’t forget to return the appropriate HTTP status code, in this case a 308 Permanent Redirect.

GET / index page

Finally, for the index page, we’ll need to return some HTML instead (and a different content-type to go with the HTML).

I decided to put the HTML file in a static folder, which is loaded and cached the first time the function is invoked.

Getting ready for production

Fortunately I have had plenty of practice getting Lambda functions to production readiness, and for this URL shortener we will need to:

  • configure auto-scaling parameters for the DynamoDB table (which we have an internal system for managing the auto-scaling side of things)
  • turn on caching in API Gateway for the production stage

Future Improvements

If you put in the same URL multiple times you’ll get back different short-urls, one optimization (for storage and caching) would be to return the same short-url instead.

To accomplish this, you can:

  1. add GSI to the DynamoDB table on the longUrl attribute to support efficient reverse lookup
  2. in the shortenUrl function, perform a GET with the GSI to find existing short url(s)

I think it’s better to add a GSI than to create a new table here because it avoids having “transactions” that span across multiple tables.

Useful Links