With AWS Lambda and the Serverless framework, deploying your code has become so simple and frictionless.
As you move more and more of your architecture to run on Lambda, you might find that, in addition to getting things done faster you are also deploying your code more frequently.
But, as you rejoice in this new found superpower to make your users and stakeholders happy, you need to keep an eye out for that regional limit of 75GB for all the uploaded deployment packages.
At Yubl, me and a small team of 6 server engineers managed to rack up nearly 20GB of deployment packages in 3 months.
We wrote all of our Lambda functions in Nodejs, and deployment packages were typically less than 2MB. But the frequency of deployments made sure that the overall size of deployment packages went up steadily.
Now that I’m writing most of my Lambda functions in Scala (it’s the weapon of choice for the Space Ape Games server team), I’m dealing with deployment packages that are significantly bigger!
Serverless framework: disable versionFunctions
By default, the Serverless framework would create a new version of your function every time you deploy.
In Serverless 0.X, this is (kinda) needed because it used function alias. For example, I can have multiple deployment stages for the same function?—?dev, staging and production. But in the Lambda console there is only one function, and each stage is simply an alias pointing to a different version of the same function.
Unfortunately this behaviour also made it difficult to manage the IAM permissions because multiple versions of the same function share the same IAM role. Since you can’t version the IAM role with the function, this makes it hard for you to add or remove permissions without breaking older versions.
Fortunately, the developers listened to the community and since the 1.0 release each stage is deployed as a separate function.
Essentially, this allows you to “version” IAM roles with deployment stages since each stage gets a separate IAM role. So there’s technically no need for you to create a new version for every deployment anymore. But, that is still the default behaviour, unless you explicitly disable it in your serverless.ymlby setting versionFunctions to false.
You might argue that having old versions of the function in production makes it quicker to rollback.
In that case, enable it for the production stage only. To do that, here’s a handy trick to allow a default configuration in your serverless.yml to be overridable by deployment stage.
In my personal experience though, unless you have taken great care and used aliases to tag the production releases it’s actually quite hard to know which version correlates to what. Assuming that you have reproducible builds, I would have much more confidence if we rollback by deploying from a hotfixor support branch of our code.
Clean up old versions with janitor-lambda
If disabling versionFunctions in the serverless.yml for all of your projects is hard to enforce, another approach would be to retroactively delete old versions of functions that are no longer referenced by an alias.
To do that, you can create a cron job (ie. scheduled CloudWatch event + Lambda) that will scan through your functions and look for versions that are not referenced and delete them.
I took some inspiration from Netflix’s Janitor Monkey and created a Janitor Lambda function that you can deploy to your AWS environment to clean unused versions of your functions.
After we employed this Janitor Lambda function, our total deployment package went from 20GB to ~1GB (we had a lot of functions…).
However, as our architecture expanded we found several drawbacks with managing configurations with environment variables.
Hard to share configs across projects
The biggest problem for us was the inability to share configurations across projects since environment variables are function specific at runtime.
The Serverless framework has the notion of services, which is just a way of grouping related functions together. You can specify service-wide environment variables as well as function-specific ones.
However, we often found that configurations need to be shared across multiple services. When these configurations change we had to update and redeploy all functions that depend on them – which in itself was becoming a challenge to track these dependencies across many Github repos that are maintained by different members of the team.
For example, as we were migrating from a monolithic system piece by piece whilst delivering new features, we weren’t able to move away from the monolithic MongoDB database in one go. It meant that lots of functions shared MongoDB connection strings. When one of these connection strings changed – and it did several times – pain and suffering followed.
Another configurable value we often share are the root URL of intermediate services. Being a social network, many of our user-initiated operations depend on relationship data, so many of our microservices depend on the Relationship API. Instead of hardcoding the URL to the Relationship API in every service (one of the deadly microservice anti-patterns), it should be stored in a central configuration service.
Hard to implement fine-grained access control
When you need to configure sensitive data such as credentials, API keys or DB connection strings, the rule of thumb are:
data should be encrypted at rest (includes not checking them into source control in plain text)
data should be encrypted in-transit
apply the principle of least privilege to function’s and personnel’s access to data
If you’re operating in a heavily regulated environment then point 3. might be more than a good practice but a regulatory requirement. I know of many fintech companies and financial juggernauts where access to production credentials are tightly controlled and available only to a handful of people in the company.
Whilst efforts such as the serverless-secrets-plugin delivers on point 1. it couples one’s ability to deploy Lambda functions with one’s access to sensitive data – ie. he who deploys the function must have access to the sensitive data too. This might be OK for many startups, as everyone has access to everything, ideally your process for managing access to data can evolve with the company’s needs as it grows up.
SSM Parameter Store
My team outgrew environment variables, and I started looking at other popular solutions in this space – etcd, consul, etc. But I really didn’t fancy these solutions because:
they’re costly to run: you need to run several EC2 instances in multi-AZ setting for HA
you have to manage these servers
they each have a learning curve with regards to both configuring the service as well as the CLI tools
we needed a fraction of the features they offer
This was 5 months before Amazon announced SSM Parameter Store at re:invent 2016, so at the time we built our own Configuration API with API Gateway and Lambda.
Nowadays, you should just use the SSM Parameter Store because:
it’s a fully managed service
sharing configurations is easy, as it’s a centralised service
it integrates with KMS out-of-the-box
it offers fine-grained control via IAM
it records a history of changes
you can use it via the console, AWS CLI as well as via its HTTPS API
Having a centralised place to store parameters is just one side of the coin. You should still invest effort into making a robust client library that is easy to use, and supports:
caching & cache expiration
hot-swapping configurations when source config value has changed
Here is one such client library that I put together for a demo:
To use it, you can create config objects with the loadConfigs function. These objects will expose properties that return the config values as Promise (hence the yield, which is the magic power we get with co).
You can have different config values with different cache expiration too.
If you want to play around with using SSM Parameter Store from Lambda (or to see this cache client in action), then check out this repo and deploy it to your AWS environment. I haven’t included any HTTP events, so you’d have to invoke the functions from the console.
Update 15/09/2017: the Serverless framework release 1.22.0 which introduced support for SSM parameters out of the box.
With this latest version of the Serverless framework, you can specify the value of environment variables to come from SSM parameter store directly.
Compared to many of the existing approaches, it has some benefits:
avoid checking in sensitive data in plain text in source control
avoid duplicating the same config values in multiple services
However, it still falls short on many fronts (based on my own requirements):
since it’s fetching the SSM parameter values at deployment time, it still couples your ability to deploy your function with access to sensitive configuration data
the configuration values are stored in plain text as Lambda environment variables, which means you don’t need the KMS permissions to access them, you can see it the Lambda console in plain sight
further to the above, if the function is compromised by an attacker (who would then have access to process.env) then they’ll be able to easily find the decrypted values during the initial probe (go to 13:05 mark on this video where I gave a demo of how easily this can be done)
because the values are baked at deployment time, it doesn’t allow you to easily propagate config value changes. To make a config value change, you will need to a) identify all dependent functions; and b) re-deploying all these functions
Of course, your requirement might be very different from mine, and I certainly think it’s an improvement over many of the approaches I have seen. But, personally I still think you should:
fetch SSM parameter values at runtime
cache these values, and hot-swap when source values change
Compared to JSON – which is the bread and butter for APIs built with API Gateway and Lambda – these binary formats can produce significantly smaller payloads.
At scale, they can make a big difference to your bandwidth cost.
In restricted environments such as low-end devices or in countries with poor mobile connections, sending smaller payloads can also improve your user experience by improving the end-to-end network latency, and possibly processing time on the device too.
Follow these 3 simple steps (assuming you’re using Serverless framework):
As you can see, it’s just a bunch of randomly generated names and GUIDs, and integers. The same response in Protocol Buffers isnearly 40% smaller.
Problem with the protobufjs package
Before we move on, there is one important detail about using the protobufjspacakge in a Lambda function – you need tonpm installthe package on a Linux system.
This is because it has a dependency that is distributed as native binaries, so if you installed the packaged on OSX then the binaries that are packaged and deployed to Lambda will not run on the Lambda execution environment.
I had similar problems with other Google libraries in the past. I find the best way to deal with this is to take a leaf out of aws-serverless-go-shim’s approach and deploy your code inside a Docker container.
This way, you would locally install a compatible version of the native binaries for your OS so you can continue to run and debug your function with sls invoke local (see this post for details).
But, during deployment, a script would run npm install --force in a Docker container running a compatible Linux distribution. This would then install a version of the native binaries that can be executed in the Lambda execution environment. The script would then use sls deploy to deploy the function.
The deployment script can be something simple like this:
In the demo project, I also have a docker-compose.yml file:
The Serverless framework requires my AWS credentials, hence why I’ve attached the $HOME/.aws directory to the container for the AWSSDK to find at runtime.
To deploy, run docker-compose up.
Use HTTP content negotiation
Whilst binary formats are more efficient when it comes to payload size, they do have one major problem: they’re really hard to debug.
Imagine the scenario – you have observed a bug, but you’re not sure if the problem is in the client app or the server. But hey, let’s just observe the HTTP conversation with a HTTP proxy such as Charles or Fiddler.
This workflow works great for JSON but breaks down when it comes to binary formats such as Protocol Buffers as the payloads are not human readable.
As we have discussed in this post, the human readability of JSON comes with the cost of heavier bandwidth usage. For most network communications, be it service-to-service, or service-to-client, unless a human is actively “reading” the payloads it’s not worth paying the cost. But when a human is trying to read it, that human readability is very valuable.
Fortunately, HTTP’s content negotiation mechanism means we can have the best of both worlds.
In the demo project, there is a contentNegotiated function which returns either JSON or Protocol Buffers payloads based on what the Accept header.
By default, you should use Protocol Buffers for all your network communications to minimise bandwidth use.
But, you should build in a mechanism for toggling the communication to JSON when you need to observe the communications. This might mean:
for debug builds of your mobile app, allow super users (devs, QA, etc.) the ability to turn on debug mode, which would switch the networking layer to send Accept header as application/json
for services, include a configuration option to turn on debug mode (see this post on configuring functions with SSM parameters and cache client for hot-swapping) to make service-to-service calls use JSON too, so you can capture and analyze the request and responses more easily
As usual, you can try out the demo code yourself, the repo is available here.
Serverless architectures are microservices by default, you need correlation IDs to help debug issues that spans across multiple functions, and possibly different event source types – asynchronous, synchronous and streams.
This is the last of a 3-part mini series on managing your AWS Lambda logs.
If you haven’t read part 1 yet, please give it a read now. We’ll be building on top of the basic infrastructure of shipping logs from CloudWatch Logs detailed in that post.
As your architecture becomes more complex, many services have to work together in order to deliver the features your users want.
When everything works, it’s like watching an orchestra, lots of small pieces all acting independently whilst at the same time collaborating to form a whole that’s greater than the sum of its parts.
However, when things don’t work, it’s a pain in the ass to debug. Finding that one clue is like finding needle in the haystack as there are so many moving parts, and they’re all constantly moving.
Imagine you’re an engineer at Twitter and trying to debug why a user’s tweet was not delivered to one of his followers’ timeline.
“Let me cross reference the logs from hundreds of services and find the logs that mention the author’s user ID, the tweet ID, or the recipient’s user ID, and put together a story of how the tweet flowed through our system and why it wasn’t delivered to the recipient’s timeline.”
“What about logs that don’t explicitly mention those fields?”
“mm… let me get back to you on that…”
Needle in the haystack.
This is the problem that correlation IDs solve in the microservice world – to tag every log message with the relevant context so that it’s easy to find them later on.
Aside from common IDs such as user ID, order ID, tweet ID, etc. you might also want to include the X-Ray trace ID in every log message. That way, if you’re using X-Ray with Lambda then you can use it to quickly load up the relevant trace in the X-Ray console.
Also, if you’re going to add a bunch of correlation IDs to every log message then you should consider switching to JSON. Then you need to update the ship-logs function we introduced in part 1 to handle log messages that are formatted as JSON.
Enable debug logging on entire call chain
Another common problem people run into, is that by the time we realise there’s a problem in production we find out that the crucial piece of information we need to debug the problem is logged as DEBUG, and we disable DEBUG logs in production because they’re too noisy.
“Darn it, now we have to enable debug logging and redeploy all these services! What a pain!”
“Don’t forget to disable debug logging and redeploy them, after you’ve found the problem ;-)”
Fortunately it doesn’t have to be a catch-22 situation. You can enable DEBUG logging on the entire call chain by:
make the decision to enable DEBUG logging (for say, 5% of all requests) at the edge service
pass the decision on all outward requests alongside the correlation IDs
on receiving the request from the edge service, possibly through async event sources such as SNS, the intermediate services will capture this decision and turn on DEBUG logging if asked to do so
the intermediate services will also pass that decision on all outward requests alongside the correlation IDs
Capture and forward correlation IDs
With that out of the way, let’s dive into some code to see how you can actually make it work. If you want to follow along, then the code is available in this repo, and the architecture of the demo project looks like this:
We can take advantage of the fact that concurrency is now managed by the platform, which means we can safely use global variables to store contextual information relevant for the current invocation.
In the handler function we can capture incoming correlation IDs in global variables, and then include them in log messages, as well as any outgoing messages/HTTP requests/events, etc.
To abstract away the implementation details, let’s create a requestContextmodule that makes it easy to fetch and update these context data:
And then add a log module which:
disables DEBUG logging by default
enables DEBUG logging if explicitly overriden via environment variables or a Debug-Log-Enabled field was captured in the incoming request alongside other correlation IDs
logs messages as JSON
Once we start capturing correlation IDs, our log messages would look something like this:
Notice that I have also captured the User-Agent from the incoming request, as well as the decision to not enable DEBUG logging.
Now let’s see how we can capture and forward correlation IDs through API Gateway and outgoing HTTP requests.
You can capture and pass along correlation IDs via HTTP headers. The trick is making sure that everyone in the team follows the same conventions.
To standardise these conventions (what to name headers that are correlation IDs, etc.) you can provide a factory function that your developers can use to create API handlers. Something like this perhaps:
When you need to implement another HTTP endpoint, pass your handler code to this factory function. Now, with minimal change, all your logs will have the captured correlation IDs (as well as User-Agent, whether to enable debug logging, etc.).
The api-a function in our earlier architecture looks something like this:
Since this is the API on the edge, so it initialises the x-correlation-id using the AWS Request ID for its invocation. This, along with several other pieces of contextual information is recorded with every log message.
By adding a custom HTTP module like this one, you can also make it easy to include these contextual information in outgoing HTTP requests. Encapsulating these conventions in an easy-to-use library also helps you standardise the approach across your team.
In the api-a function above, we made a HTTP request to the api-bendpoint. Looking in the logs, you can see the aforementioned contextual information has been passed along.
In this case, we also have the User-Agent from the original user-initiated request to api-a. This is useful because when I look at the logs for intermediate services, I often miss the context of what platform the user is using which makes it harder to correlate the information I gather from the logs to the symptoms the user describes in their bug reports.
When the api-b function (see here) makes its own outbound HTTP request to api-c it’ll pass along all of these contextual information plus anything we add in the api-b function itself.
When you see the corresponding log message in api-c’s logs, you’ll see all the context from both api-a and api-b.
To capture and forward correlation IDs through SNS messages, you can use message attributes.
In the api-a function above, we also published a message to SNS (omitted from the code snippet above) with a custom sns module which includes the captured correlation IDs as message attributes, see below.
When this SNS message is delivered to a Lambda function, you can see the correlation IDs in the MessageAttributes field of the SNS event.
Let’s create a snsHandler factory function to standardise the process of capturing incoming correlation IDs via SNS message attributes.
We can use this factory function to quickly create SNS handler functions. The log messages from these handler functions will have access to the captured correlation IDs. If you use the aforementioned custom httpmodule to make outgoing HTTP requests then they’ll be included as HTTP headers automatically.
For instance, the following SNS handler function would capture incoming correlation IDs, include them in log messages, and pass them on when making a HTTP request to api-c (see architecture diagram).
Unfortunately, with Kinesis and DynamoDB Streams, there’s no way to tag additional information with the payload. Instead, in order to pass correlation IDs along, we’d have to modify the actual payload itself.
Let’s create a kinesis module for sending events to a Kinesis stream, so that we can insert a __context field to the payload to carry the correlation IDs.
On the receiving end, we can take it out, use it to set the current requestContext, and delete this __context field before passing it on to the Kinesis handler function for processing. The sender and receiver functions won’t even notice we modified the payload.
Wait, there’s one more problem – our Lambda function will receive a batch of Kinesis records, each with its own context. How will we consolidate that?
The simplest way is to force the handler function to process records one at a time. That’s what we’ve done in the kinesisHandler factory function here.
The handler function (created with the kinesisHandler factory function) would process one record at at time, and won’t have to worry about managing the request context. All of its log messages would have the right correlation IDs, and outgoing HTTP requests, SNS messages and Kinesis events would also pass those correlation IDs along.
This approach is simple, developers working on Kinesis handler functions won’t have to worry about the implementation details of how correlation IDs are captured and passed along, and things “just work”.
However, it also removes the opportunity to optimize by processing all the records in a batch. Perhaps your handler function has to persist the events to a persistence store that’s better suited for storing large payloads rather than lots of small ones.
This simple approach is not the right fit for every situation, an alternative would be to leave the __context field on the Kinesis records and let the handler function deal with them as it sees fit. In which case you would also need to update the shared libraries – the log, http, sns and kinesismodules we have talked about so far – to give the caller to option to pass in a requestContext as override.
This way, the handler function can process the Kinesis records in a batch. Where it needs to log or make a network call in the context of a specific record, it can extract and pass the request context along as need be.
That’s it, folks. A blueprint for how to capture and forward correlation IDs through 3 of the most commonly used event sources for Lambda.
Here’s an annotated version of the architecture diagram earlier, showing the flow of data as they’re captured and forwarded from one invocation to another, through HTTP headers, message attributes, Kinesis record data.
You can find a deployable version of the code you have seen in this post in this repo. It’s intended for demo sessions in my O’Reilly course detailed below, so documentation is seriously lacking at the moment, but hopefully this post gives you a decent idea of how the project is held together.
Other event sources
There are plenty of event sources that we didn’t cover in this post.
It’s not possible to pass correlation IDs through every event source, as some do not originate from your system – eg. CloudWatch Events that are triggered by API calls made by AWS service.
And it might be hard to pass correlation IDs through, say, DynamoDB Streams – the only way (that I can think of) for it to work is to include the correlation IDs as fields in the row (which, might not be such a bad idea but it does have cost implications).
The common practice of using agents/daemons to buffer and batch send logs and metrics are no longer applicable in the world of serverless. Here are some tips to help you get the most out of your logging and monitoring infrastructure for your functions.
This is part 2 of a 3-part mini series on managing your AWS Lambda logs.
If you haven’t read part 1 yet, please give it a read now. We’ll be building on top of the basic infrastructure of shipping logs from CloudWatch Logs detailed in that post.
Much have changed with the serverless paradigm, and it solves many of the old problems we face and replaced them with some new problems that (I think) are easier to deal with.
Consequently, many of the old practices are no longer applicable – eg. using agents/daemons to buffer and batch send metrics and logs to monitoring and log aggregation services. However, even as we throw away these old practices for the new world of serverless, we are still after the same qualities that made our old tools “good”:
able to collect rich set of system and application metrics and logs
publishing metrics and logs should not add user-facing latency (ie. they should be performed in the background)
metrics and logs should appear in realtime (ie. within a few seconds)
metrics should be granular
Unfortunately, the current tooling for Lambda – CloudWatch metrics & CloudWatch Logs – are failing on a few of these, some more so than others:
publishing custom metrics requires additional network calls that need to be made during the function’s execution, adding to user-facing latency
CloudWatch metrics for AWS services are only granular down to 1 minute interval (custom metrics can be granular down to 1 second)
CloudWatch metrics are often a few minutes behind (though custom metrics might have less lag now that they can be recorded at 1 second interval)
CloudWatch Logs are usually more than 10s behind (not precise measurement, but based on personal observation)
With Lambda, we have to rely on AWS to improve CloudWatch in order to bring us parity with existing “server-ful” services.
Many vendors have announced support for Lambda, such as Datadog and Wavefront. However, as they are using the same metrics from CloudWatch they will have the same lag.
IOPipe is a popular alternative for monitoring Lambda functions and they do things slightly differently – by giving you a wrapper function around your code so they can inject monitoring code (it’s a familiar pattern to those who have used AOP frameworks in the past).
For their 1.0 release they also announced support for tracing (see the demo video below), which I think it’s interesting as AWS already offers X-Ray and it’s a more complete tracing solution (despite its own shortcomings as I mentioned in this post).
IOPipe seems like a viable alternative to CloudWatch, especially if you’re new to AWS Lambda and just want to get started quickly. I can totally see the value of that simplicity.
However, I have some serious reservations with IOPipe’s approach:
A wrapper around every one of my functions? This level of pervasive access to my entire application requires a serious amount of trust that has to be earned, especially in times like this.
CloudWatch collects logs and metrics asynchronously without adding to my function’s execution time. But with IOPipe they have to send the metrics to their own system, and they have to do so during my function’s execution time and hence adding to user-facing latency (for APIs).
Further to the above points, it’s another thing that can cause my function to error or time out even after my code has successfully executed. Perhaps they’re doing something smart to minimise that risk but it’s hard for me to know for sure and I have to anticipate failures.
Of all the above, the latency overhead is the biggest concern for me. Between API Gateway and Lambda I already have to deal with cold start and the latency between API Gateway and Lambda. As your microservice architecture expands and the no. of inter-service communications grows, these latencies will compound further.
For background tasks this is less a concern, but a sizeable portion of Lambda functions I have written have to handle HTTP requests and I need to keep the execution time as low as possible for these functions.
Sending custom metrics asynchronously
I find Datadog’s approach for sending custom metrics very interesting. Essentially you write custom metrics as specially-formatted log messages that Datadog will process (you have to set up IAM permissions for CloudWatch to call their function) and track them as metrics.
It’s a simple and elegant approach, and one that we can adopt for ourselves even if we decide to use another monitoring service.
In part 1 we established an infrastructure to ship logs from CloudWatch Logs to a log aggregation service of our choice. We can extend the log shipping function to look for log messages that look like these:
For these log messages, we will interpret them as:
And instead of sending them to the log aggregation service, we’ll send them as metrics to our monitoring service instead. In this particular case, I’m using CloudWatch in my demo (see link below), so the format of the log message reflects the fields I need to pass along in the PutMetricData call.
To send custom metrics, we write them as log messages. Again, no latency overhead as Lambda service collects these for us and sends them to CloudWatch in the background.
And moments later they’re available in CloudWatch metrics.
Take a look at the custom-metrics function in this repo.
Tracking the memory usage and billed duration of your AWS Lambda functions in CloudWatch
Lambda reports the amount of memory used, and the billed duration at the end of every invocation. Whilst these are not published as metrics in CloudWatch, you can find them as log messages in CloudWatch Logs.
I rarely find memory usage to be an issue as Nodejs functions have such a small footprint. My choice of memory allocation is primarily based on getting the right balance between cost and performance. In fact, Alex Casalboni of CloudAcademy wrote a very nice blog post on using Step Functions to help you find that sweet spot.
The Billed Duration on the other hand, is a useful metric when viewed side by side with Invocation Duration. It gives me a rough idea of the amount of wastage I have. For example, if the average Invocation Durationof a function is 42ms but the average Billed Duration is 100ms, then there is a 58% wastage and maybe I should consider running the function on a lower memory allocation.
Interestingly, IOPipe records these in their dashboard out of the box.
However, we don’t need to add IOPipe just to get these metrics. We can apply a similar technique to the previous section and publish them as custom metrics to our monitoring service.
To do that, we have to look out for these REPORT log messages and parse the relevant information out of them. Each message contains 3 pieces of information we want to extract:
Billed Duration (Milliseconds)
Memory Size (MB)
Memory Used (MB)
We will parse these log messages and return an array of CloudWatch metric data for each, so we can flat map over them afterwards.
And sure enough, after subscribing the log group for an API (created in the same demo project to test this) and invoking the API, I’m able to see these new metrics show up in CloudWatch metrics.
Looking at the graph, maybe I can reduce my cost by running it on a much smaller memory size.
Take a look at the usage-metrics function in this repo.
Mind the concurrency!
When processing CloudWatch Logs with Lambda functions, you need to be mindful of the no. of concurrent executions it creates so to not run foul of the concurrent execution limit.
Since this is an account-wide limit, it means your log-shipping function can cause cascade failures throughout your entire application. Critical functions can be throttled because too many executions are used to push logs out of CloudWatch Logs – not a good way to go down ;-)
What we need is a more fine-grained throttling mechanism for Lambda. It’s fine to have an account-wide limit, but we should be able to create pools of functions that can have slices of that limit. For example, tier-1 functions (those serving the core business needs) gets 90% of the available concurrent executions. Whilst tier-2 functions (BI, monitoring, etc.) gets the other 10%.
As things stand, we don’t have that, and the best you can do is to keep the execution of your log-shipping function brief. Maybe that means fire-and-forget when sending logs and metrics; or send the decoded log messages into a Kinesis stream where you have more control over parallelism.
Or, maybe you’ll monitor the execution count of these tier-2 functions and when the no. of executions/minute breaches some threshold you’ll temporarily unsubscribe log groups from the log-shipping function to alleviate the problem.
Or, maybe you’ll install some bulkheads by moving these tier-2 functions into a separate AWS account and use cross-account invocation to trigger them. But this seems a really heavy-handed way to workaround the problem!
Point is, it’s not a solved problem and I haven’t come across a satisfying workaround yet. AWS is aware of this gap and hopefully they’ll add support for better control over concurrent executions.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.