Capture and forward correlation IDs through different Lambda event sources

Serverless architectures are microservices by default, you need correlation IDs to help debug issues that spans across multiple functions, and possibly different event source types – asynchronous, synchronous and streams.

This is the last of a 3-part mini series on managing your AWS Lambda logs.

If you haven’t read part 1 yet, please give it a read now. We’ll be building on top of the basic infrastructure of shipping logs from CloudWatch Logs detailed in that post.

part 1 : centralise logging

part 2: tips and tricks

Why correlation IDs?

As your architecture becomes more complex, many services have to work together in order to deliver the features your users want.

Microservice death stars, circa 2015.

When everything works, it’s like watching an orchestra, lots of small pieces all acting independently whilst at the same time collaborating to form a whole that’s greater than the sum of its parts.

However, when things don’t work, it’s a pain in the ass to debug. Finding that one clue is like finding needle in the haystack as there are so many moving parts, and they’re all constantly moving.

Imagine you’re an engineer at Twitter and trying to debug why a user’s tweet was not delivered to one of his followers’ timeline.

“Let me cross reference the logs from hundreds of services and find the logs that mention the author’s user ID, the tweet ID, or the recipient’s user ID, and put together a story of how the tweet flowed through our system and why it wasn’t delivered to the recipient’s timeline.”

“What about logs that don’t explicitly mention those fields?”

“mm… let me get back to you on that…”

Needle in the haystack.

This is the problem that correlation IDs solve in the microservice world – to tag every log message with the relevant context so that it’s easy to find them later on.

Aside from common IDs such as user ID, order ID, tweet ID, etc. you might also want to include the X-Ray trace ID in every log message. That way, if you’re using X-Ray with Lambda then you can use it to quickly load up the relevant trace in the X-Ray console.

By default, Lambda automatically generates a _X_AMZN_TRACE_ID value in the environment variable.

Also, if you’re going to add a bunch of correlation IDs to every log message then you should consider switching to JSON. Then you need to update the ship-logs function we introduced in part 1 to handle log messages that are formatted as JSON.

Enable debug logging on entire call chain

Another common problem people run into, is that by the time we realise there’s a problem in production we find out that the crucial piece of information we need to debug the problem is logged as DEBUG, and we disable DEBUG logs in production because they’re too noisy.

“Darn it, now we have to enable debug logging and redeploy all these services! What a pain!”

“Don’t forget to disable debug logging and redeploy them, after you’ve found the problem ;-)”

Fortunately it doesn’t have to be a catch-22 situation. You can enable DEBUG logging on the entire call chain by:

  1. make the decision to enable DEBUG logging (for say, 5% of all requests) at the edge service
  2. pass the decision on all outward requests alongside the correlation IDs
  3. on receiving the request from the edge service, possibly through async event sources such as SNS, the intermediate services will capture this decision and turn on DEBUG logging if asked to do so
  4. the intermediate services will also pass that decision on all outward requests alongside the correlation IDs

The edge service decides to turn DEBUG logging on for 5% of requests, that decision is captured and passed along throughout the entire call chain, through HTTP requests, SNS message and Kinesis events.

Capture and forward correlation IDs

With that out of the way, let’s dive into some code to see how you can actually make it work. If you want to follow along, then the code is available in this repo, and the architecture of the demo project looks like this:

The demo project consists of an edge API, api-a, which initialises a bunch of correlation IDs as well as the decision on whether or not to turn on debug logging. It’ll pass these along through HTTP requests to api-b, Kinesis events and SNS messages. Each of these downstream function would in turn capture and pass them along to api-c.

We can take advantage of the fact that concurrency is now managed by the platform, which means we can safely use global variables to store contextual information relevant for the current invocation.

In the handler function we can capture incoming correlation IDs in global variables, and then include them in log messages, as well as any outgoing messages/HTTP requests/events, etc.

To abstract away the implementation details, let’s create a requestContextmodule that makes it easy to fetch and update these context data:

And then add a log module which:

  • disables DEBUG logging by default
  • enables DEBUG logging if explicitly overriden via environment variables or a Debug-Log-Enabled field was captured in the incoming request alongside other correlation IDs
  • logs messages as JSON

Once we start capturing correlation IDs, our log messages would look something like this:

Notice that I have also captured the User-Agent from the incoming request, as well as the decision to not enable DEBUG logging.

Now let’s see how we can capture and forward correlation IDs through API Gateway and outgoing HTTP requests.

API Gateway

You can capture and pass along correlation IDs via HTTP headers. The trick is making sure that everyone in the team follows the same conventions.

To standardise these conventions (what to name headers that are correlation IDs, etc.) you can provide a factory function that your developers can use to create API handlers. Something like this perhaps:

When you need to implement another HTTP endpoint, pass your handler code to this factory function. Now, with minimal change, all your logs will have the captured correlation IDs (as well as User-Agent, whether to enable debug logging, etc.).

The api-a function in our earlier architecture looks something like this:

Since this is the API on the edge, so it initialises the x-correlation-id using the AWS Request ID for its invocation. This, along with several other pieces of contextual information is recorded with every log message.

By adding a custom HTTP module like this one, you can also make it easy to include these contextual information in outgoing HTTP requests. Encapsulating these conventions in an easy-to-use library also helps you standardise the approach across your team.

In the api-a function above, we made a HTTP request to the api-bendpoint. Looking in the logs, you can see the aforementioned contextual information has been passed along.

In this case, we also have the User-Agent from the original user-initiated request to api-a. This is useful because when I look at the logs for intermediate services, I often miss the context of what platform the user is using which makes it harder to correlate the information I gather from the logs to the symptoms the user describes in their bug reports.

When the api-b function (see here) makes its own outbound HTTP request to api-c it’ll pass along all of these contextual information plus anything we add in the api-b function itself.

Log message for when api-b calls api-c with the custom HTTP module. Notice it includes the “x-correlation-character-b” header which is set by the api-b function.

When you see the corresponding log message in api-c’s logs, you’ll see all the context from both api-a and api-b.

SNS

To capture and forward correlation IDs through SNS messages, you can use message attributes.

In the api-a function above, we also published a message to SNS (omitted from the code snippet above) with a custom sns module which includes the captured correlation IDs as message attributes, see below.

When this SNS message is delivered to a Lambda function, you can see the correlation IDs in the MessageAttributes field of the SNS event.

Let’s create a snsHandler factory function to standardise the process of capturing incoming correlation IDs via SNS message attributes.

We can use this factory function to quickly create SNS handler functions. The log messages from these handler functions will have access to the captured correlation IDs. If you use the aforementioned custom httpmodule to make outgoing HTTP requests then they’ll be included as HTTP headers automatically.

For instance, the following SNS handler function would capture incoming correlation IDs, include them in log messages, and pass them on when making a HTTP request to api-c (see architecture diagram).

Those correlation IDs (including the one added by the SNS handler function) are included as HTTP headers.

Kinesis Streams

Unfortunately, with Kinesis and DynamoDB Streams, there’s no way to tag additional information with the payload. Instead, in order to pass correlation IDs along, we’d have to modify the actual payload itself.

Let’s create a kinesis module for sending events to a Kinesis stream, so that we can insert a __context field to the payload to carry the correlation IDs.

On the receiving end, we can take it out, use it to set the current requestContext, and delete this __context field before passing it on to the Kinesis handler function for processing. The sender and receiver functions won’t even notice we modified the payload.

Wait, there’s one more problem – our Lambda function will receive a batch of Kinesis records, each with its own context. How will we consolidate that?

The simplest way is to force the handler function to process records one at a time. That’s what we’ve done in the kinesisHandler factory function here.

The handler function (created with the kinesisHandler factory function) would process one record at at time, and won’t have to worry about managing the request context. All of its log messages would have the right correlation IDs, and outgoing HTTP requests, SNS messages and Kinesis events would also pass those correlation IDs along.

When api-c receives the invocation event, you can see the correlation IDs have been passed along via HTTP headers.

This approach is simple, developers working on Kinesis handler functions won’t have to worry about the implementation details of how correlation IDs are captured and passed along, and things “just work”.

However, it also removes the opportunity to optimize by processing all the records in a batch. Perhaps your handler function has to persist the events to a persistence store that’s better suited for storing large payloads rather than lots of small ones.

This simple approach is not the right fit for every situation, an alternative would be to leave the __context field on the Kinesis records and let the handler function deal with them as it sees fit. In which case you would also need to update the shared libraries – the loghttpsns and kinesismodules we have talked about so far – to give the caller to option to pass in a requestContext as override.

This way, the handler function can process the Kinesis records in a batch. Where it needs to log or make a network call in the context of a specific record, it can extract and pass the request context along as need be.

The End

That’s it, folks. A blueprint for how to capture and forward correlation IDs through 3 of the most commonly used event sources for Lambda.

Here’s an annotated version of the architecture diagram earlier, showing the flow of data as they’re captured and forwarded from one invocation to another, through HTTP headers, message attributes, Kinesis record data.

You can find a deployable version of the code you have seen in this post in this repo. It’s intended for demo sessions in my O’Reilly course detailed below, so documentation is seriously lacking at the moment, but hopefully this post gives you a decent idea of how the project is held together.

Other event sources

There are plenty of event sources that we didn’t cover in this post.

It’s not possible to pass correlation IDs through every event source, as some do not originate from your system – eg. CloudWatch Events that are triggered by API calls made by AWS service.

And it might be hard to pass correlation IDs through, say, DynamoDB Streams – the only way (that I can think of) for it to work is to include the correlation IDs as fields in the row (which, might not be such a bad idea but it does have cost implications).

Tips and tricks for logging and monitoring AWS Lambda functions

The common practice of using agents/daemons to buffer and batch send logs and metrics are no longer applicable in the world of serverless. Here are some tips to help you get the most out of your logging and monitoring infrastructure for your functions.

This is part 2 of a 3-part mini series on managing your AWS Lambda logs.

If you haven’t read part 1 yet, please give it a read now. We’ll be building on top of the basic infrastructure of shipping logs from CloudWatch Logs detailed in that post.

part 1 : centralise logging

part 3 : tracking correlation IDs

New paradigm, new problems

Much have changed with the serverless paradigm, and it solves many of the old problems we face and replaced them with some new problems that (I think) are easier to deal with.

Consequently, many of the old practices are no longer applicable – eg. using agents/daemons to buffer and batch send metrics and logs to monitoring and log aggregation services. However, even as we throw away these old practices for the new world of serverless, we are still after the same qualities that made our old tools “good”:

  1. able to collect rich set of system and application metrics and logs
  2. publishing metrics and logs should not add user-facing latency (ie. they should be performed in the background)
  3. metrics and logs should appear in realtime (ie. within a few seconds)
  4. metrics should be granular

Unfortunately, the current tooling for Lambda – CloudWatch metrics & CloudWatch Logs – are failing on a few of these, some more so than others:

  • publishing custom metrics requires additional network calls that need to be made during the function’s execution, adding to user-facing latency
  • CloudWatch metrics for AWS services are only granular down to 1 minute interval (custom metrics can be granular down to 1 second)
  • CloudWatch metrics are often a few minutes behind (though custom metrics might have less lag now that they can be recorded at 1 second interval)
  • CloudWatch Logs are usually more than 10s behind (not precise measurement, but based on personal observation)

With Lambda, we have to rely on AWS to improve CloudWatch in order to bring us parity with existing “server-ful” services.

Many vendors have announced support for Lambda, such as Datadog and Wavefront. However, as they are using the same metrics from CloudWatch they will have the same lag.

IOPipe is a popular alternative for monitoring Lambda functions and they do things slightly differently – by giving you a wrapper function around your code so they can inject monitoring code (it’s a familiar pattern to those who have used AOP frameworks in the past).

For their 1.0 release they also announced support for tracing (see the demo video below), which I think it’s interesting as AWS already offers X-Ray and it’s a more complete tracing solution (despite its own shortcomings as I mentioned in this post).

IOPipe seems like a viable alternative to CloudWatch, especially if you’re new to AWS Lambda and just want to get started quickly. I can totally see the value of that simplicity.

However, I have some serious reservations with IOPipe’s approach:

  • A wrapper around every one of my functions? This level of pervasive access to my entire application requires a serious amount of trust that has to be earned, especially in times like this.
  • CloudWatch collects logs and metrics asynchronously without adding to my function’s execution time. But with IOPipe they have to send the metrics to their own system, and they have to do so during my function’s execution time and hence adding to user-facing latency (for APIs).
  • Further to the above points, it’s another thing that can cause my function to error or time out even after my code has successfully executed. Perhaps they’re doing something smart to minimise that risk but it’s hard for me to know for sure and I have to anticipate failures.

Of all the above, the latency overhead is the biggest concern for me. Between API Gateway and Lambda I already have to deal with cold start and the latency between API Gateway and Lambda. As your microservice architecture expands and the no. of inter-service communications grows, these latencies will compound further.

For background tasks this is less a concern, but a sizeable portion of Lambda functions I have written have to handle HTTP requests and I need to keep the execution time as low as possible for these functions.

Sending custom metrics asynchronously

I find Datadog’s approach for sending custom metrics very interesting. Essentially you write custom metrics as specially-formatted log messages that Datadog will process (you have to set up IAM permissions for CloudWatch to call their function) and track them as metrics.

Datadog allows you to send custom metrics using log messages in their DogStatsD format.

It’s a simple and elegant approach, and one that we can adopt for ourselves even if we decide to use another monitoring service.

In part 1 we established an infrastructure to ship logs from CloudWatch Logs to a log aggregation service of our choice. We can extend the log shipping function to look for log messages that look like these:

Log custom metrics as specially formatted log messages

For these log messages, we will interpret them as:

MONITORING|metric_value|metric_unit|metric_name|metric_namespace

And instead of sending them to the log aggregation service, we’ll send them as metrics to our monitoring service instead. In this particular case, I’m using CloudWatch in my demo (see link below), so the format of the log message reflects the fields I need to pass along in the PutMetricData call.

To send custom metrics, we write them as log messages. Again, no latency overhead as Lambda service collects these for us and sends them to CloudWatch in the background.

And moments later they’re available in CloudWatch metrics.

Custom metrics are recorded in CloudWatch as expected.

Take a look at the custom-metrics function in this repo.

Tracking the memory usage and billed duration of your AWS Lambda functions in CloudWatch

Lambda reports the amount of memory used, and the billed duration at the end of every invocation. Whilst these are not published as metrics in CloudWatch, you can find them as log messages in CloudWatch Logs.

At the end of every invocation, Lambda publishes a REPORT log message detailing the max amount of memory used by your function during this invocation, and how much time is billed (Lambda charges at 100ms blocks).

I rarely find memory usage to be an issue as Nodejs functions have such a small footprint. My choice of memory allocation is primarily based on getting the right balance between cost and performance. In fact, Alex Casalboni of CloudAcademy wrote a very nice blog post on using Step Functions to help you find that sweet spot.

The Billed Duration on the other hand, is a useful metric when viewed side by side with Invocation Duration. It gives me a rough idea of the amount of wastage I have. For example, if the average Invocation Durationof a function is 42ms but the average Billed Duration is 100ms, then there is a 58% wastage and maybe I should consider running the function on a lower memory allocation.

Interestingly, IOPipe records these in their dashboard out of the box.

IOPipes records a number of additional metrics that are not available in CloudWatch, such as Memory Usage and CPU Usage over time, as well as coldstarts.

However, we don’t need to add IOPipe just to get these metrics. We can apply a similar technique to the previous section and publish them as custom metrics to our monitoring service.

To do that, we have to look out for these REPORT log messages and parse the relevant information out of them. Each message contains 3 pieces of information we want to extract:

  • Billed Duration (Milliseconds)
  • Memory Size (MB)
  • Memory Used (MB)

We will parse these log messages and return an array of CloudWatch metric data for each, so we can flat map over them afterwards.

This is a function in the “parse” module, which maps a log message to an array of CloudWatch metric data.
Flat map over the CloudWatch metric data returned by the above parse.usageMetrics function and publish them.

And sure enough, after subscribing the log group for an API (created in the same demo project to test this) and invoking the API, I’m able to see these new metrics show up in CloudWatch metrics.

Looking at the graph, maybe I can reduce my cost by running it on a much smaller memory size.

Take a look at the usage-metrics function in this repo.

Mind the concurrency!

When processing CloudWatch Logs with Lambda functions, you need to be mindful of the no. of concurrent executions it creates so to not run foul of the concurrent execution limit.

Since this is an account-wide limit, it means your log-shipping function can cause cascade failures throughout your entire application. Critical functions can be throttled because too many executions are used to push logs out of CloudWatch Logs – not a good way to go down ;-)

What we need is a more fine-grained throttling mechanism for Lambda. It’s fine to have an account-wide limit, but we should be able to create pools of functions that can have slices of that limit. For example, tier-1 functions (those serving the core business needs) gets 90% of the available concurrent executions. Whilst tier-2 functions (BI, monitoring, etc.) gets the other 10%.

As things stand, we don’t have that, and the best you can do is to keep the execution of your log-shipping function brief. Maybe that means fire-and-forget when sending logs and metrics; or send the decoded log messages into a Kinesis stream where you have more control over parallelism.

Or, maybe you’ll monitor the execution count of these tier-2 functions and when the no. of executions/minute breaches some threshold you’ll temporarily unsubscribe log groups from the log-shipping function to alleviate the problem.

Or, maybe you’ll install some bulkheads by moving these tier-2 functions into a separate AWS account and use cross-account invocation to trigger them. But this seems a really heavy-handed way to workaround the problem!

Point is, it’s not a solved problem and I haven’t come across a satisfying workaround yet. AWS is aware of this gap and hopefully they’ll add support for better control over concurrent executions.

Centralised logging for AWS Lambda

CloudWatch Logs is hardly the ideal fit for all your logging needs, fortunately you can easily stream the logs to your preferred log aggregation service with AWS Lambda functions.

This is the first of a 3-part mini series on managing your AWS Lambda logs. In part 1 we will look at how you can get all of your logs off CloudWatch.

Part 2 will help you better understand the tradeoffs with different approaches to logging & monitoring, with some helpful tips and tricks that I have come across.

Part 3 will demonstrate how to capture and forward correlation IDs through various event sources – eg. API Gateway, SNS and Kinesis.

part 2 : tips and tricks

part 3 : tracking correlation IDs

AWS Lambda logging basics

During the execution of a Lambda function, whatever you write to stdout (eg. using console.log in Node.js) will be captured by Lambda and sent to CloudWatch Logs asynchronously in the background, without adding any overhead to your function execution time.

You can find all the logs for your Lambda functions in CloudWatch Logs, organised into log groups (one log group per function) and then log streams (one log stream per container instance).

You could, of course, send these logs to CloudWatch Logs yourself via the PutLogEvents operation, or send them to your preferred log aggregation service such as Splunk or Elasticsearch. But, remember that everything has to be done during a function’s invocation. If you’re making additional network calls during the invocation then you’ll pay for those additional execution time, and your users would have to wait that much longer for the API to respond.

So, don’t do that!

Instead, process the logs from CloudWatch Logs after the fact.

Streaming CloudWatch Logs

In the CloudWatch Logs console, you can select a log group (one for each Lambda function) and choose to stream the data directly to Amazon’s hosted Elasticsearch service.

This is very useful if you’re using the hosted Elasticsearch service already. But if you’re still evaluating your options, then give this post a read before you decide on the AWS-hosted Elasticsearch.

As you can see from the screenshot above, you can also choose to stream the logs to a Lambda function instead. In fact, when you create a new function from the Lambda console, there’s a number of blueprints for pushing CloudWatch Logs to other log aggregation services already.

Clearly this is something a lot of AWS’s customers have asked for.

You can find blueprints for shipping CloudWatch Logs to Sumologic, Splunk and Loggly out of the box.

So that’s great, now you can use these blueprints to help you write a Lambda function that’ll ship CloudWatch Logs to your preferred log aggregation service. But here are a few things to keep in mind.

Auto-subscribe new log groups

Whenever you create a new Lambda function, it’ll create a new log group in CloudWatch logs. You want to avoid a manual process for subscribing log groups to your ship-logs function above.

Instead, enable CloudTrail, and then setup an event pattern in CloudWatch Events to invoke another Lambda function whenever a log group is created.

You can do this one-off setup in the CloudWatch console manually.

Match the CreateLogGroup API call in CloudWatch Logs and trigger a subscribe-log-group Lambda function to subscribe the newly created log group to the ship-logs function you created earlier.

If you’re working with multiple AWS accounts, then you should avoid making the setup a manual process. With the Serverless framework, you can setup the event source for this subscribe-log-group function in the serverless.yml file.

Another thing to keep in mind is that, you need to avoid subscribing the log group for the ship-logs function to itself – it’ll create an infinite invocation loop and that’s a painful lesson that you want to avoid.

Auto-setting the log retention policy

By default, when Lambda creates a new log group for your function the retention policy is to keep them forever. Understandably this is overkill and the cost of storing all these logs can add up over time. 

By default, logs for your Lambda functions are kept forever

Fortunately, using the same technique above we can add another Lambda function to automatically update the retention policy to something more reasonable.

Here’s a Lambda function for auto-updating the log retention policy to 30 days.

Taking care of existing log groups

If you already have lots of existing log groups, then consider wrapping the demo code (below) for auto-subscribing log groups and auto-updating log retention policy into a one-off script to update them all.

You can do this by recursing through all log groups with the DescribeLogGroups API call, and then invoke the corresponding functions for each log group.

You can find example code in this repo.

Write recursive AWS Lambda functions the right way

You may not realise that you can write AWS Lambda functions in a recursive manner to perform long-running tasks. Here’s two tips to help you do it right.

AWS Lambda limits the maximum execution time of a single invocation to 5 minutes. Whilst this limit might be raised in the future, it’s likely that you’ll still have to consider timeouts for any long-running tasks. For this reason, I personally think it’s a good thing that the current limit is too low for many long running tasks?—?it forces you to consider edge cases early and avoid the trap of thinking “it should be long enough to do X” without considering possible failure modes.

Instead, you should write Lambda functions that perform long-running tasks as recursive functions?—?eg. processing a large S3 file.

Here’s 2 tips to help you do it right.

use context.getRemainingTimeInMillis()

When your function is invoked, the context object allows you to find out how much time is left in the current invocation.

Suppose you have an expensive task that can be broken into small tasks that can be processed in batches. At the end of each batch, use context.getRemainingTimeInMillis() to check if there’s still enough time to keep processing. Otherwise, recurse and pass along the current position so the next invocation can continue from where it left off.

use local state for optimization

Whilst Lambda functions are ephemeral by design, containers are still reused for optimization which means you can still leverage in-memory states that are persisted through invocations.

You should use this opportunity to avoid loading the same data on each recursion?—?eg. you could be processing a large S3 file and it’s more efficient (and cheaper) to cache the content of the S3 file.

I notice that AWS has also updated their Lambda best practices page to advise you to take advantage of container reuse:

However, as Lambda can recycle the container between recursions, it’s possible for you to lose the cached state from one invocation to another. Therefore, you shouldn’t assume the cached state to always be available during a recursion, and always check if there’s cached state first.

Also, when dealing with S3 objects, you need to protect yourself against content changes?—?ie. S3 object is replaced, but container instance is still reused so the cache data is still available. When you call S3’s GetObject operation, you should set the optional If-None-Match parameter with the ETag of the cached data.

Here’s how you can apply this technique.

Have a look at this example Lambda function that recursively processes a S3 file, using the approach outlined in this post.

Many-faced threats to Serverless security

Threats to the security of our serverless applications take many forms, some are the same old foes we have faced before; some are new; and some have taken on new forms in the serverless world.

As we adopt the serverless paradigm for building cloud-hosted applications, we delegate even more of the operational responsibilities to our cloud providers. When you build a serverless architecture around AWS Lambda you no longer have to configure AMIs, patch the OS, and install daemons to collect and distribute logs and metrics for your application. AWS takes care all of that for you.

What does this mean for the Shared Responsibility Model that has long been the cornerstone of security in the AWS cloud?

Protection from attacks against the OS

AWS takes over the responsibility for maintaining the host OS as part of their core competency. Hence alleviating you from the rigorous task of applying all the latest security patches – something most of us just don’t do a good enough job of as it’s not our primary focus.

In doing so, it protects us from attacks against known vulnerabilities in the OS and prevents attacks such as WannaCry.

Also, by removing long lived servers from the picture we also removed the threats posed by compromised servers that live in our environment for a long time.

WannaCry happened because the MS17–017 security patch was not applied to the affected hosts.

WannaCry happened because the MS17–017 security patch was not applied to the affected hosts.

However, it is still our responsibility to patch our application and address vulnerabilities that exist in our code and our dependencies.

OWASP top 10 is still as relevant as ever

Aside from a few reclassifications the OWASP top 10 list has largely stayed the same in 7 years.

Aside from a few reclassifications the OWASP top 10 list has largely stayed the same in 7 years.

A glance at the OWASP top 10 list for 2017 shows us some familiar threats – Injection, Cross-Site Scripting, Sensitive Data Exposure, and so on.

A9 – Components with Known Vulnerabilities

When the folks at Snyk looked at a dataset of 1792 data breaches in 2016 they found that 12 of the top 50 data breaches were caused by applications using components with known vulnerabilities.

Furthermore, 77% of the top 5000 URLs from Alexa include at least one vulnerable library. This is less surprising than it first sound when you consider that some of the most popular front-end js frameworks – eg. jquery, Angular and React – all had known vulnerabilities. It highlights the need to continuously update and patch your dependencies.

However, unlike OS patches which are standalone, trusted and easy to apply, security updates to these 3rd party dependencies are usually bundled with feature and API changes that need to be integrated and tested. It makes our life as developers difficult and it’s yet another thing we gotta do when we’re working overtime to ship new features.

And then there’s the matter of transient dependencies, and boy there are so many of them… If these transient dependencies have vulnerabilities then you too are vulnerable through your direct dependencies.

https://david-dm.org/request/request?view=tree

https://david-dm.org/request/request?view=tree

Finding vulnerabilities in our dependencies is hard work and requires constant diligence, which is why services such as Snyk is so useful. It even comes with a built-in integration with Lambda too!

Attacks against NPM publishers

What if the author/publisher of your 3rd party dependency is not who you think they are?

What if the author/publisher of your 3rd party dependency is not who you think they are?

Just a few weeks ago, a security bounty hunter posted this amazing thread on how he managed to gain direct push rights to 14% of NPM packages. The list of affected packages include some big names too: debug, request, react, co, express, moment, gulp, mongoose, mysql, bower, browserify, electron, jasmine, cheerio, modernizr, redux and many more. In total, these packages account for 20% of the total number of monthly downloads from NPM.

Let that sink in for a moment.

Did he use highly sophisticated methods to circumvent NPM’s security?

Nope, it was a combination of brute force and using known account & credential leaks from a number of sources including Github. In other words, anyone could have pulled these off with very little research.

It’s hard not to feel let down by these package authors when so many display such a cavalier attitude towards securing access to their NPM accounts. I feel my trust in these 3rd party dependencies have been betrayed.

662 users had password «123456», 174 – «123», 124 – «password».

1409 users (1%) used their username as their password, in its original form, without any modifications.

11% of users reused their leaked passwords: 10.6% – directly, and 0.7% – with very minor modifications.

As I demonstrated in my recent talk on Serverless security, one can easily steal temporary AWS credentials from affected Lambda functions (or EC2-hosted Node.js applications) with a few lines of code.

Imagine then, a scenario where an attacker had managed to gain push rights to 14% of all NPM packages. He could publish a patch update to all these packages and steal AWS credentials at a massive scale.

The stakes are high and it’s quite possibly the biggest security threat we face in the serverless world; and it’s equally threatening to applications hosted in containers or VMs.

The problems and risks with package management is not specific to the Node.js ecosystem. I have spent most of my career working with .Net technologies and am now working with Scala at Space Ape Games, package management has been a challenge everywhere. Whether or not you’re talking about Nuget or Maven, or whatever package repository, you’re always at risk if the authors of your dependencies do not exercise the same due diligence to secure their accounts as they would their own applications.

Or, perhaps they do…

A1 – Injection & A3 – XSS

SQL injection and other forms of injection attacks are still possible in the serverless world, as are cross-site scripting attacks.

Even if you’re using NoSQL databases you might not be exempt from injection attacks either. MongoDB for instance, exposes a number of attack vectors through its query APIs.

Arguably DynamoDB’s API makes it hard (at least I haven’t heard of a way yet) for an attacker to orchestrate an injection attack, but you’re still open to other forms of exploits – eg. XSS, and leaked credentials which grants attacker access to DynamoDB tables.

Nonetheless, you should always sanitize user inputs, as well as the output from your Lambda functions.

A6 – Sensitive Data Exposure

Along with servers, web frameworks also disappeared when one migrates to the serverless paradigm. These web frameworks have served us well for many years, but they also handed us a loaded gun we can shot ourselves in the foot with.

As Troy Hunt demonstrated at a recent talk at the LDNUG, we can accidentally expose all kinds of sensitive data by accidentally leaving directory listing options ON. From web.config containing credentials (at 35:28) to SQL backups files (at 1:17:28)!

With API Gateway and Lambda, accidental exposures like this become very unlikely – directory listing is a “feature” you’d have to implement yourself. It forces you to make very conscious decisions about when to support directory listing and the answer is probably never.

IAM

If your Lambda functions are compromised, then the next line of defence is to restrict what these compromised functions can do.

This is why you need to apply the Least Privilege Principle when configuring Lambda permissions.

In the Serverless framework, the default behaviour is to use the same IAM role for all functions in the service.

However, the serverless.yml spec allows you to specify a different IAM role per function. Although as you can see from the examples it involves a lot more development effort and (from my experience) adds enough friction that almost no one does this…

Apply per-function IAM policies.

Apply per-function IAM policies.

IAM policy not versioned with Lambda

A shortcoming with the current Lambda + IAM configuration is that IAM policies are not versioned along with the Lambda function.

In the scenario where you have multiple versions of the same function in active use (perhaps with different aliases), then it becomes problematic to add or remove permissions:

  • adding a new permission to a new version of the function allows old versions of the function additional access that they don’t require (and poses a vulnerability)
  • removing an existing permission from a new version of the function can break old versions of the function that still require that permission

Since the 1.0 release of the Serverless framework this has become less a problem as it no longer use aliases for stages – instead, each stage is deployed as a separate function, eg.

  • service-function-dev
  • service-function-staging
  • service-function-prod

which means it’s far less likely that you’ll need to have multiple versions of the same function in active use.

I also found (from personal experience) account level isolation can help mitigate the problems of adding/removing permissions, and crucially, the isolation also helps compartmentalise security breaches – eg. a compromised function running in a non-production account cannot be used to cause harm in the production account and impact your users.

We can apply the same idea of bulkheads (which has been popularised in the microservices world by Michael Nygard’s “Release It”) and compartmentalise security breaches at an account level.

We can apply the same idea of bulkheads (which has been popularised in the microservices world by Michael Nygard’s “Release It”) and compartmentalise security breaches at an account level.

Delete unused functions

One of the benefits of the serverless paradigm is that you don’t pay for functions when they’re not used.

The flip side of this property is that you have less need to remove unused functions since they don’t show up on your bill. However, these functions still exist as attack surface, even more so than actively used functions because they’re less likely to be updated and patched. Over time, these unused functions can become a hotbed for components with known vulnerabilities that attackers can exploit.

Lambda’s documentations also cites this as one of the best practices.

Delete old Lambda functions that you are no longer using.

The changing face of DoS attacks

With AWS Lambda you are far more likely to scale your way out of a DoS attack. However, scaling your serverless architecture aggressively to fight a DoS attack with brute force has a significant cost implication.

No wonder people started calling DoS attacks against serverless applications Denial of Wallet (DoW) attacks!

“But you can just throttle the no. of concurrent invocations, right?”

Sure, and you end up with a DoS problem instead… it’s a lose-lose situation.

AWS recently introduced AWS Shield but at the time of writing the payment protection (only if you pay a monthly flat fee for AWS Shield Advanced) does not cover Lambda costs incurred during a DoS attack.

For a monthly flat fee, AWS Shield Advanced gives you cost protection in the event of a DoS attack, but that protection does not cover Lambda yet.

For a monthly flat fee, AWS Shield Advanced gives you cost protection in the event of a DoS attack, but that protection does not cover Lambda yet.

Also, Lambda has an at-least-once invocation policy. According to the folks at SunGard, this can result in up to 3 (successful) invocations. From the article, the reported rate of multiple invocations is extremely low – 0.02% – but one wonders if the rate is tied to the load and might manifest itself at a much higher rate during a DoS attack.

Taken from the “Run, Lambda, Run” article below.

Taken from the “Run, Lambda, Run” article mentioned above.

Furthermore, you need to consider how Lambda retries failed invocations by an asynchronous source – eg. S3, SNS, SES, CloudWatch Events, etc. Officially, these invocations are retried twice before they’re sent to the assigned DLQ (if any).

However, an analysis by the OpsGenie guys showed that the no. of retries are not carved in stone and can go up to as many as 6 before the invocation is sent to the DLQ.

If the DoS attacker is able to trigger failed async invocations (perhaps by uploading files to S3 that will cause your function to except when attempting to process) then they can significantly magnify the impact of their attack.

All these add up to the potential for the actual no. of Lambda invocations to explode during a DoS attack. As we discussed earlier, whilst your infrastructure might be able to handle the attack, can your wallet stretch to the same extend? Should you allow it to?

Securing external data

Just a handful of the places you could be storing state outside of your stateless Lambda function.

Just a handful of the places you could be storing state outside of your stateless Lambda function.

Due to the ephemeral nature of Lambda functions, chances are all of your functions are stateless. More than ever, states are stored in external systems and we need to secure them both at rest and in-transit.

Communication to all AWS services are via HTTPS and every request needs to be signed and authenticated. A handful of AWS services also offer server-side encryption for your data at rest – S3, RDS and Kinesis streams springs to mind, and Lambda has built-in integration with KMS to encrypt your functions’ environment variables.

The same diligence needs to be applied when storing sensitive data in services/DBs that do not offer built-in encryption – eg. DynamoDB, Elasticsearch, etc. – and ensure they’re protected at rest. In the case of a data breach, it provides another layer of protection for our users’ data.

We owe our users that much.

Use secure transport when transmitting data to and from services (both external and internal ones). If you’re building APIs with API Gateway and Lambda then you’re forced to use HTTPS by default, which is a good thing. However, API Gateway is always publicly accessible and you need to take the necessary precautions to secure access to internal APIs.

You can use API keys but I think it’s better to use IAM roles. It gives you fine grained control over who can invoke which actions on which resources. Also, using IAM roles spares you from awkward conversations like this:

“It’s X’s last day, he probably has our API keys on his laptop somewhere, should we rotate the API keys just in case?”

“mm.. that’d be a lot of work, X is trustworthy, he’s not gonna do anything.”

“ok… if you say so… (secretly pray X doesn’t lose his laptop or develop a belated grudge against the company)”

Fortunately, both can be easily configured using the Serverless framework.

Leaked credentials

Don’t become an unwilling bitcoin miner.

Don’t become an unwilling bitcoin miner.

The internet is full of horror stories of developers racking up a massive AWS bill after their leaked credentials are used by cyber-criminals to mine bitcoins. For every such story many more would have been affected but choose to be silent (for the same reason many security breaches are not announced publicly as big companies do not want to appear incompetent).

Even within my small social circle (*sobs) I have heard 2 such incidents, neither were made public and both resulted in over $100k worth of damages. Fortunately, in both cases AWS agreed to cover the cost.

I know for a fact that AWS actively scan public Github repos for active AWS credentials and try to alert you as soon as possible. But as some of the above stories mentioned, even if your credentials were leaked only for a brief window of time it will not escape the watchful gaze of attackers. (plus, they still exist in git commit history unless you rewrite the history too, best to deactivate the credentials if possible).

A good approach to prevent AWS credential leaks is to use git pre-commit hooks as outlined by this post.

Conclusions

We looked at a number of security threats to our serverless applications in this post, many of them are the same threats that have plighted the software industry for years. All of the OWASP top 10 still apply to us, including SQL, NoSQL and other forms of injection attacks.

Leaked AWS credentials remain a major issue and can potentially impact any organisation that uses AWS. Whilst there are quite a few publicly reported incidents, I have a strong feeling that the actual no. of incidents are much much higher.

We are still responsible for securing our users’ data both at rest as well as in-transit. API Gateway is always publicly accessible, so we need to take the necessary precautions to secure access to our internal APIs, preferably with IAM roles. IAM offers fine grained control over who can invoke which actions on your API resources, and make it easy to manage access when employees come and go.

On a positive note, having AWS take over the responsibility for the security of the host OS gives us a no. of security benefits:

  • protection against OS attacks because AWS can do a much better job of patching known vulnerabilities in the OS
  • host OS are ephemeral which means no long-lived compromised servers
  • it’s much harder to accidentally leave sensitive data exposed by forgetting to turn off directory listing options in a web framework – because you no longer need such web frameworks!

DoS attacks have taken a new form in the serverless world. Whilst you’re probably able to scale your way out of an attack, it’ll still hurt you in the wallet, hence why DoS attacks have become known in the serverless world as Denial of Wallet attacks. Lambda costs incurred during a DoS attack is not covered by AWS Shield Advanced at the time of writing, but hopefully they will be in the near future.

Meanwhile, some new attack surfaces have emerged with AWS Lambda:

  • functions are often given too much permission because they’re not given individual IAM policies tailored to their needs, a compromised function can therefore do more harm than it might otherwise
  • unused functions are often left around for a long time as there is no cost associated with them, but attackers might be able to exploit them especially if they’re not actively maintained and therefore are likely to contain known vulnerabilities

Above all, the most worrisome threat for me are attacks against the package authors themselves. It has been shown that many authors do not take the security of their accounts seriously, and as such endangers themselves as well as the rest of the community that depends on them. It’s impossible to guard against such attacks and erodes one of the strongest aspect of any software ecosystem – the community behind it.

Once again, people have proven to be the weakest link in the security chain.