Yubl’s road to Serverless architecture – Part 2 – Testing and CI/CD

Note: see here for the rest of the series.

 

Having spoken to quite a few people about using AWS Lambda in production, testing and CI/CD are always high up the list of questions, so I’d like to use this post to discuss the approaches that we took at Yubl.

Please keep in mind that this is a recollection of what we did, and why we chose to do things that way. I have heard others advocate very different approaches, and I’m sure they too have their reasons and their approaches no doubt work well for them. I hope to give you as much context (or, the “why”) as I can so you can judge whether or not our approach would likely work for you, and feel free to ask questions in the comments section.

 

Testing

In Growing Object-Oriented Software, Guided by Tests, Nat Pryce and Steve Freeman talked about the 3 levels of testing [Chapter 1]:

  1. Acceptance – does the whole system work?
  2. Integration – does our code work against code we can’t change?
  3. Unit – do our objects do the right thing, are they easy to work with?

As you move up the level (acceptance -> unit) the speed of the feedback loop becomes faster, but you also have less confidence that your system will work correctly when deployed.

Favour Acceptance and Integration Tests

With the FAAS paradigm, there are more “code we can’t change” than ever (AWS even describes Lambda as the “glue for your cloud infrastructure”) so the value of integration and acceptance tests are also higher than ever. Also, as the “code we can’t change” are easily accessible as service, it also makes these tests far easier to orchestrate and write than before.

The functions we tend to write were fairly simple and didn’t have complicated logic (most of the time), but there were a lot of them, and they were loosely connected through messaging systems (Kinesis, SNS, etc.) and APIs. The ROI for acceptance and integration tests are therefore far greater than unit tests.

It’s for these reason that we decided (early on in our journey) to focus our efforts on writing acceptance and integration tests, and only write unit tests where the internal workings of a Lambda function is sufficiently complex.

No Mocks

In Growing Object-Oriented Software, Guided by TestsNat Pryce and Steve Freeman also talked about why you shouldn’t mock types that you can’t change [Chapter 8], because…

…We find that tests that mock external libraries often need to be complex to get the code into the right state for the functionality we need to exercise.

The mess in such tests is telling us that the design isn’t right but, instead of fixing the problem by improving the code, we have to carry the extra complexity in both code and test…

…The second risk is that we have to be sure that the behaviour we stub or mock matches what the external library will actually do…

Even if we get it right once, we have to make sure that the tests remain valid when we upgrade the libraries…

I believe the same principles apply here, and that you shouldn’t mock services that you can’t change.

Integration Tests

Lambda function is ultimately a piece of code that AWS invokes on your behalf when some input event occurs. To test that it integrates correctly with downstream systems you can invoke the function from your chosen test framework (we used Mocha).

Since the purpose is to test the integration points, so it’s important to configure the function to use the same downstream systems as the real, deployed code. If your function needs to read from/write to a DynamoDB table then your integration test should be using the real table as opposed to something like dynamodb-local.

It does mean that your tests can leave artefacts in your integration environment and can cause problems when running multiple tests in parallel (eg. the artefacts from one test affect results of other tests). Which is why, as a rule-of-thumb, I advocate:

  • avoid hard-coded IDs, they often cause unintentional coupling between tests
  • always clean up artefacts at the end of each test

The same applies to acceptance tests.

Acceptance Tests

(this picture is slightly misleading in that the Mocha tests are not invoking the Lambda function programmatically, but rather invoking it indirectly via whatever input event the Lambda function is configured with – API Gateway, SNS, Kinesis, etc. More on this later.)

…Wherever possible, an acceptance test should exercise the system end-to-end without directly calling its internal code.

An end-to-end test interacts with the system only from the outside: through its interface…

…We prefer to have the end-to-end tests exercise both the system and the process by which it’s built and deployed

This sounds like a lot of effort (it is), but has to be done anyway repeatedly during the software’s lifetime…

– Growing Object-Oriented Software, Guided by Tests [Chapter 1]

Once the integration tests complete successfully, we have good confidence that our code will work correctly when it’s deployed. The code is deployed, and the acceptance tests are run against the deployed system end-to-end.

Take our Search API for instance, one of the acceptance criteria is “when a new user joins, he should be searchable by first name/last name/username”.

The acceptance test first sets up the test condition – a new user joins – by interacting with the system from the outside and calling the legacy API like the client app would. From here, a new-user-joined event will be fired into Kinesis; a Lambda function would process the event and add a new document in the User index in CloudSearch; the test would validate that the user is searchable via the Search API.

Avoid Brittle Tests

Because a new user is added to CloudSearch asynchronously via a background process, it introduces eventual consistency to the system. This is a common challenge when you decouple features through events/messages. When testing these eventually consistent systems, you should avoid waiting fixed time periods (see protip 5 below) as it makes your tests brittle.

In the “new user joins” test case, this means you shouldn’t write tests that:

  1. create new user
  2. wait 3 seconds
  3. validate user is searchable

and instead, write something along the lines of:

  1. create new user
  2. validate user is searchable with retries
    1. if expectation fails, then wait X seconds before retrying
    2. repeat
    3. allow Y retries before failing the test case

Sharing test cases for Integration and Acceptance Testing

We also found that, most of the time the only difference between our integration and acceptance tests is how our function code is invoked. Instead of duplicating a lot of code and effort, we used a simple technique to allow us to share the test cases.

Suppose you have a test case such as the one below.

The interesting bit is on line 22:

let res = yield when.we_invoke_get_all_keys(region);

In the when module, the function we_invoke_get_all_keys will either

  • invoke the function code directly with a stubbed context object, or
  • perform a HTTP GET request against the deployed API

depending on the value of process.env.TEST_MODE, which is an environment variable that is passed into the test via package.json (see below) or the bash script we use for deployment (more on this shortly).

 

Continuous Integration + Continuous Delivery

Whilst we had around 170 Lambda functions running production, many of them work together to provide different features to the app. Our approach was to group these functions such that:

  • functions that form the endpoints of an API are grouped in a project
  • background processing functions for a feature are grouped in a project
  • each project has its own repo
  • functions in a project are tested and deployed together

The rationale for this grouping strategy is to:

  • achieve high cohesion for related functions
  • improve code sharing where it makes sense (endpoints of an API are likely to share some logic since they operate within the same domain)

Although functions are grouped into projects, they can still be deployed individually. We chose to deploy them as a unit because:

  • it’s simple, and all related functions (in a project) have the same version no.
  • it’s difficult to detect if a change to shared code will impact which functions
  • deployment is fast, it makes little difference speed-wise whether we’re deploy one function or five functions

 

For example, in the Yubl app, you have a feed of posts from people you follow (similar to your Twitter timeline).

To implement this feature there was an API (with multiple endpoints) as well as a bunch of background processing functions (connected to Kinesis streams and SNS topics).

The API has two endpoints, but they also share a common custom auth function, which is included as part of this project (and deployed together with the get and get-yubl functions).

The background processing (initially only Kinesis but later expanded to include SNS as well, though the repo wasn’t renamed) functions have many shared code, such as the distribute module you see below, as well as a number of modules in the lib folder.

All of these functions are deployed together as a unit.

Deployment Automation

We used the Serverless framework to do all of our deployments, and it took care of packaging, uploading and versioning our Lambda functions and APIs. It’s super useful and took care of most of the problem for us, but we still needed a thin layer around it to allow AWS profile to be passed in and to include testing as part of the deployment process.

We could have scripted these steps on the CI server, but I have been burnt a few times by magic scripts that only exist on the CI server (and not in source control). To that end, every project has a simple build.sh script (like the one below) which gives you a common vocabulary to:

  • run unit/integration/acceptance tests
  • deploy your code

Our Jenkins build configs do very little and just invoke this script with different params.

Continuous Delivery

To this day I’m still confused by Continuous “Delivery” vs Continuous “Deployment”. There seems to be several interpretations, but this is the one that I have heard the most often:

Regardless of which definition is correct, what was most important to us was the ability to deploy our changes to production quickly and frequently.

Whilst there were no technical reasons why we couldn’t deploy to production automatically, we didn’t do that because:

  • it gives QA team opportunity to do thorough tests using actual client apps
  • it gives the management team a sense of control over what is being released and when (I’m not saying if this is a good or bad thing, but merely what we wanted)

In our setup, there were two AWS accounts:

  • production
  • non-prod, which has 4 environments – dev, test, staging, demo

(dev for development, test for QA team, staging is a production-like, and demo for private beta builds for investors, etc.)

In most cases, when a change is pushed to Bitbucket, all the Lambda functions in that project are automatically tested, deployed and promoted all the way through to the staging environment. The deployment to production is a manual process that can happen at our convenience and we generally avoid deploying to production on Friday afternoon (for obvious reasons ).

 

Conclusions

The approaches we have talked about worked pretty well for our team, but it was not without drawbacks.

In terms of development flow, the focus on integration and acceptance tests meant slower feedback loops and the tests take longer to execute. Also, because we don’t mock downstream services it means we couldn’t run tests without internet connection, which is an occasional annoyance when you want to work during commute.

These were explicit tradeoffs we made, and I stand by them even now and AFAIK everyone in the team feels the same way.

 

In terms of deployment, I really missed the ability to do canary releases. Although this is offset by the fact that our user base was still relatively small and the speed with which one can deploy and rollback changes with Lambda functions was sufficient to limit the impact of a bad change.

Whilst AWS Lambda and API Gateway doesn’t support canary releases out-of-the-box it is possible to do a DIY solution for APIs using weighted routing in Route53. Essentially you’ll have:

  • a canary stage for API Gateway and associated Lambda function
  • deploy production builds to the canary stage first
  • use weighted routing in Route53 to direct X% traffic to the canary stage
  • monitor metrics, and when you’re happy with the canary build, promote it to production

Again, this would only work for APIs and not for background processing (SNS, Kinesis, S3, etc.).

 

So that’s it folks, hope you’ve enjoyed this post, feel free to leave a comment if you have any follow up questions or tell me what else you’d like to hear about in part 3.

Ciao!

 

Links

Yubl’s road to Serverless architecture – Part 1

Note: see here for the rest of the series.

 

Since Yubl’s closure quite a few people have asked about the serverless architecture we ended up with and some of the things we have learnt along the way.

As such, this is the first of a series of posts where I’d share some of the lessons we learnt. However, bear in mind the pace of change in this particular space so some of the challenges/problems we encountered might have been solved by the time you read this.

ps. many aspects of this series is already covered in a talk I gave on Amazon Lambda at Leetspeak this year, you can find the slides and recording of the talk here.

 

From A Monolithic Beginning

Back when I joined Yubl in April I inherited a monolithic Node.js backend running on EC2 instances, with MongoLab (hosted MongoDB) and CloudAMQP (hosted RabbitMQ) thrown into the mix.

yubl-monolith

There were numerous problems with the legacy system, some could be rectified with incremental changes (eg. blue-green deployment) but others required a rethink at an architectural level. Although things look really simple on paper (at the architecture diagram level), all the complexities are hidden inside each of these 3 services and boy, there were complexities!

My first tasks were to work with the ops team to improve the existing deployment pipeline and to draw up a list of characteristics we’d want from our architecture:

  • able to do small, incremental deployments
  • deployments should be fast, and requires no downtime
  • no lock-step deployments
  • features can be deployed independently
  • features are loosely coupled through messages
  • minimise cost for unused resources
  • minimise ops effort

From here we decided on a service-oriented architecture, and Amazon Lambda seemed the perfect tool for the job given the workloads we had:

  • lots of APIs, all HTTPS, no ultra-low latency requirement
  • lots of background tasks, many of which has soft-realtime requirement (eg. distributing post to follower’s timeline)

 

To a Serverless End

It’s suffice to say that we knew the migration was going to be a long road with many challenges along the way, and we wanted to do it incrementally and gradually increase the speed of delivery as we go.

“The lead time to someone saying thank you is the only reputation metric that matters”

– Dan North

The first step of the migration was to make the legacy systems publish state changes in the system (eg. user joined, user A followed user B, etc.) so that we can start building new features on top of the legacy systems.

To do this, we updated the legacy systems to publish events to Kinesis streams.

 

Our general strategy is:

  • build new features on top of these events, which usually have their own data stores (eg. DynamoDB, CloudSearch, S3, BigQuery, etc.) together with background processing pipelines and APIs
  • extract existing features/concepts from the legacy system into services that will run side-by-side
    • these new services will initially be backed by the same shared MongoLab database
    • other services (including the legacy ones) are updated to use hand-crafted API clients to access the encapsulated resources via the new APIs rather than hitting the shared MongoLab database directly
    • once all access to these resources are done via the new APIs, data migration (usually to DynamoDB tables) will commence behind the scenes
  • wherever possible, requests to existing API endpoints are forwarded to the new APIs so that we don’t have to wait for the iOS and Android apps to be updated (which can take weeks) and can start reaping the benefits earlier

 

After 6 months of hard work, my team of 6 backend engineers (including myself) have drastically transformed our backend infrastructure. Amazon was very impressed by the work we were doing with Lambda and in the process of writing up a case study of our work when Yubl was shut down at the whim of our major shareholder.

Here’s an almost complete picture of the architecture we ended up with (some details are omitted for brevity and clarity).

overall

Some interesting stats:

  • 170 Lambda functions running in production
  • roughly 1GB of total deployment package size (after Janitor Lambda cleans up unreferenced versions)
  • Lambda cost was around 5% of what we pay for EC2 for a comparable amount of compute
  • the no. of production deployments increased from 9/month in April to 155 in September

 

For the rest of the series I’ll drill down into specific features, how we utilised various AWS services, and how we tackled the challenges of:

  • centralised logging
  • centralised configuration management
  • distributed tracing with correlation IDs for Lambda functions
  • keeping Lambda functions warm to avoid coldstart penalty
  • auto-scaling AWS resources that do not scale dynamically
  • automatically clean up old Lambda function versions
  • securing sensitive data (eg. mongodb connection string, service credentials, etc.)

I can also explain our strategy for testing, and running/debugging functions locally, and so on. If there’s anything you’d like me to cover in particular, please leave a comment and let me know.

 

Links

Slides and recording of my Lambda talk at LeetSpeak 2016

AWS Lambda – use recursive function to process SQS messages (Part 2)

First of all, apologies for taking months to write this up since part 1, I have been extremely busy since joining Yubl. We have done a lot of work with AWS Lambda and I hope to share more of the lessons we have learnt soon, but for now take a look at this slidedeck to get a flavour.

 

At the end of part 1 we have a recursive Lambda function that will:

  1. fetch messages from SQS using long-polling
  2. process any received messages
  3. recurse by invoking itself asynchronously

however, on its own there are a number of problems:

  • any unhandled exceptions will kill the loop
  • there’s no way to scale the no. of processing loops elastically

 

We can address these concerns with a simple mechanism where:

  • a DynamoDB table stores a row for each loop that should be running
  • when scaling up, a new row is added to the table, and a new loop is started with the Hash Key (let’s call it a token) as input;
  • on each recursion, the looping function will check if its token is still valid
    • if yes, then the function will upsert a last_used timestamp against its token
    • if not (ie, the token is deleted) then terminates the loop
  • when scaling down, one of the tokens is chosen at random and deleted from the table, which causes the corresponding loop to terminate at the end of its current recursion (see above)
  • another Lambda function is triggered every X mins to look for tokens whose last_used timestamp is older than some threshold, and starts a new loop in its place

Depending on how long it takes the processing function to process all its SQS messages – limited by the max 5 min execution time – you can either adjust the threshold accordingly. Alternatively, you can provide another GUID when starting each loop (say, a loop_id?) and extend the loop’s token check to make sure the loop_id associated with its token matches its own.

 

High Level Overview

To implement this mechanism, we will need a few things:

  • a DynamoDB table to store the tokens
  • a function to look for dead loops restart them
  • a CloudWatch event to trigger the restart function every X mins
  • a function to scale_up the no. of processing loops
  • a function to scale_down the no. of processing loops
  • CloudWatch alarm(s) to trigger scaling activity based on no. of messages in the queue
  • SNS topics for the alarms to publish to, which in turn triggers the scale_up and scale_down functions accordingly*

* the SNS topics are required purely because there’s no support for CloudWatch alarms to trigger Lambda functions directly yet.

architecture

A working code sample is available on github which includes:

  • handler code for each of the aforementioned functions
  • s-resources-cf.json that will create the DynamoDB table when deployed
  • _meta/variables/s-variables-*.json files which also specifies the name of the DynamoDB table (and referenced by s-resources-cf.json above)

I’ll leave you to explore the code sample at your own leisure, and instead touch on a few of the intricacies.

 

Setting up Alarms

Assuming that your queue is mostly empty as your processing is keeping pace with the rate of new items, then a simple staggered set up should suffice here, for instance:

cw-alarm-2

Each of the alarms would send notification to the relevant SNS topic on ALARM and OK.

cw-alarm-1

 

Verifying Tokens

Since there’s no way to message a running Lambda function directly, we need a way to signal it to stop recursing somehow, and this is where the DynamoDB table comes in.

At the start of each recursion, the processing-msg function will check against the table to see if its token still exists. And since the table is also used to record a heartbeat from the function (for the restart function to identify failed recursions), it also needs to upsert the last_used timestamp against the token.

We can combine the two ops in one conditional write, which will fail if the token doesn’t exist, and that error will terminate the loop.

ps. I made the assumption in the sample code that processing the SQS messages is quick, hence why the timeout setting for the process-msg function is 30s (20s long polling + 10s processing) and the restart function’s threshold is a very conservative 2 mins (it can be much shorter even after you take into account the effect of eventual consistency).

If your processing logic can take some time to execute – bear in mind that the max timeout allowed for a Lambda function is 5 mins – then here’s a few options that spring to mind:

  1. adjust the restart function’s threshold to be more than 5 mins : which might not be great as it lengthens your time to recovery when things do go wrong;
  2. periodically update the last_used timestamp during processing : which also needs to be conditional writes, whilst swallowing any errors;
  3. add an loop_id in to the DynamoDB table and include it in the ConditionExpression : that way, you keep the restart function’s threshold low and allow the process-msg function to occasionally overrun; when it does, a new loop is started in its place and takes over its token with a new loop_id so that when the overrunning instance finishes it’ll be stopped when it recurses (and fails to verify its token because the loop_id no longer match)

Both option 2 & 3 strike me as reasonable approaches, depending on whether your processing logic are expected to always run for some time (eg, involves some batch processing) or only in unlikely scenarios (occasionally slow third-party API calls).

 

Scanning for Failed Loops

The restart function performs a table scan against the DynamoDB table to look for tokens whose last_used timestamp is either:

  • not set : the process-msg function never managed to set it during the first recursion, perhaps DynamoDB throttling issue or temporary network issue?
  • older than threshold : the process-msg function has stopped for whatever reason

By default, a Scan operation in DynamoDB uses eventually consistent read, which can fetch data that are a few seconds old. You can set the ConsistentRead parameter to true; or, be more conservative with your thresholds.

Also, there’s a size limit of 1mb of scanned data per Scan request. So you’ll need to perform the Scan operation recursively, see here.

 

Find out QueueURL from SNS Message

In the s-resources-cf.json file, you might have noticed that the DynamoDB table has the queue URL of the SQS queue as hash key. This is so that we could use the same mechanism to scale up/down & restart processing functions for many queues (see the section below).

But this brings up a new question: “when the scale-up/scale-down functions are called (by CloudWatch Alarms, via SNS), how do we work out which queue we’re dealing with?”

When SNS calls our function, the payload looks something like this:

which contains all the information we need to workout the queue URL for the SQS queue in question.

We’re making an assumption here that the triggered CloudWatch Alarm exist in the same account & region as the scale-up and scale-down functions (which is a pretty safe bet I’d say).

 

Knowing when NOT to Scale Down

Finally – and these only apply to the scale-down function – there are two things to keep in mind when scaling down:

  1. leave at least 1 processing loop per queue
  2. ignore OK messages whose old state is not ALARM

The first point is easy, just check if the query result contains more than 1 token.

The second point is necessary because CloudWatch Alarms have 3 states – OK, INSUFFICIENT_DATA and ALARM – and we only want to scale down when transitioned from ALARM => OK.

 

Taking It Further

As you can see, there is a fair bit of set up involved. It’d be a waste if we have to do the same for every SQS queue we want to process.

Fortunately, given the current set up there’s nothing stopping you from using the same infrastructure to manage multiple queues:

  • the DynamoDB table is keyed to the queue URL already, and
  • the scale-up and scale-down functions can already work with CloudWatch Alarms for any SQS queues

Firstly, you’d need to implement additional configuration so that given a queue URL you can work out which Lambda function to invoke to process messages from that queue.

Secondly, you need to decide between:

  1. use the same recursive function to poll SQS, but forward received messages to relevant Lambda functions based on the queue URL, or
  2. duplicate the recursive polling logic in each Lambda function (maybe put them in a npm package so to avoid code duplication)

Depending on the volume of messages you’re dealing with, option 1 has a cost consideration given there’s a Lambda invocation per message (in addition to the polling function which is running non-stop).

 

So that’s it folks, it’s a pretty long post and I hope you find the ideas here useful. If you’ve got any feedbacks or suggestions do feel free to leave them in the comments section below.

Takeaways from “Simplifying the Future” by Adrian Cockcroft

Simplifying things in our daily lives

“Life is complicated… but we use simple abstractions to deal with it.”

– Adrian Cockcroft

When people say “it’s too complicated”, what they usually mean is “there are too many moving parts and I can’t figure out what it’s going to do next, that I haven’t figured out an internal model for how it works and what it does”.

Which bags the question: “what’s the most complicated thing that you can deal with intuitively?”

Driving, for instance, is one of the most complicated things that we have to do on a regular basis. It combines hand-eye-feet coordination, navigation skills, and ability to react to unforeseeable scenarios that can be life-or-death.

 

A good example of a simple abstraction is the touch-based interface you find on smart phones and pads. Kids can dissimulate the working of an iPad by experimenting with it, without needing any formal training because they can interact with them and get instant feedback which helps them build the mental model of how things work.

As engineers, we should inspire to build things that can be given to 2 year olds and they can intuitively understand how they operate. This last point reminds me of what Brett Victor has been saying for years, with inspirational talks such as Inventing on Principle and Stop Drawing Dead Fish.

Netflix for instance, has invested much effort in intuition engineering and are building tools to help people get a better intuitive understanding of how their complex, distributed systems are operating at any moment in time.

Another example of how you can take complex things and give them simple descriptions is XKCD’s Thing Explainer, which uses simple words to explain otherwise complex things such as the International Space Station, Nuclear Reactor and Data Centre.

sidebar: wrt to complexities in code, here are two talks that you might also find interesting

 

Simplifying work

Adrian mentioned Netflix’s slide deck on their culture and values:

Intentional culture is becoming an important thing, and other companies have followed suit, eg.

It conditions people joining the company on what they would expect to see once they’re onboarded, and helps frame & standardise the recruitment process so that everyone knows what a ‘good’ hire looks like.

If you’re creating a startup you can set the culture from the start, don’t wait until you have accidental culture, be intentional and early about what you want to have.

 

This creates a purpose-driven culture.

Be clear and explicit about the purpose and let people work out how best to implement that purpose.

Purposes are simple statements, whereas setting out all the individual processes you need to ensure people build the right things are much harder, it’s simpler to have a purpose-driven culture and let people self-organise around those purposes.

Netflix also found that if you impose processes on people then you drive talents away, which is a big problem. Time and again, Netflix found that people produce a fraction of what they’re capable of producing at other places because they were held back by processes, rules and other things that slow them down.

On Reverse Conway’s Law, Adrian said that you should start with an organisational structure that’s cellular in nature, with clear responsibilities and ownership for a no. of small, co-located teams – high trust & high cohesion within the team, and low trust across the teams.

The morale here is that, if you build a company around a purpose-driven, systems-thinking approach then you are building organisations that are flexible and can evolve as the technology moves on.

The more rules you put in, and the more complex and rigid it gets, then you end up with the opposite.

“You build it, you run it”

– Werner Vogel, Amazon CTO

 

Simplifying the things we build

First, you should shift your thinking from projects to products, the key difference is that whereas a project has a start and end, a product will continue to evolve for as long as it still serves a purpose. On this point, see also:

“I’m sorry, but there really isn’t any project managers in this new model”

– Adrian Cockcroft

As a result, the overhead & ratio of developers to people doing management, releases & ops has to change.

 

Second, the most important metric to optimise for is time to value. (see also “beyond features” by Dan North)

“The lead time to someone saying thank you is the only reputation metric that matters”

– Dan North

Looking at the customer values and working out how to improve the time-to-value is an interesting challenge. (see Simon Wardley’s value-chain-mapping)

And lastly, and this is a subtle point – optimise for the customers that you want to have rather than the customers you have now. Which is an interesting twist on how we often think about retention and monetisation.

For Netflix, their optimisation is almost always around converting free trials to paying customers, which means they’re always optimising for people who haven’t seen the product before. Interestingly, this feedback loop also has the side-effect of forcing the product to be simple.

On the other hand, if you optimise for power users, then you’re likely to introduce more and more features that contribute towards the product being too complicated for new users. You can potentially build yourself into a corner where you struggle to attract new users and become vulnerable to a new comers into the market with simpler products that new users can understand.

 

Monolithic apps only look simple from the outside (at the architect diagram level), but if you look under the cover to see your object dependencies then the true scale of their complexities start to become apparent. And they often are complicated because it requires discipline to enforce clear separations.

“If you require constant diligence, then you’re setting everyone up for failure and hurt.”

– Brian Hunter

Microservices enforce separation that makes them less complicated, and make those connectivities between components explicit. They are also better for on-boarding as new joiners don’t have to understand all the interdependencies (inside a monolith) that encompass your entire system to make even small changes.

Each micro-service should have a clear, well-defined set of responsibilities and there’s a cap on the level of complexities they can reach.

sidebar: the best answers I have heard for “how small should a microservice be?” are:

  • “one that can be completely rewritten in 2 weeks”
  • “what can fit inside an engineer’s head” – which Psychology tells us, isn’t a lot ;-)

 

Monitoring used to be one of the things that made microservices complicated, but the tooling has caught up in this space and nowadays many vendors (such as NewRelic) offer tools that support this style of architecture out of the box.

 

Simplifying microservices architecture

If your system is deployed globally, then having the same, automated deployment for every region gives you symmetry. Having this commonality (same AMI, auto-scaling settings, deployment pipeline, etc.) is important, as is automation, because they give you known states in your system that allows you to make assertions.

It’s also important to have systems thinking, try to come up with feedback loops that drive people and machines to do the right thing.

Adrian then referenced Simon Wardley’s post on ecosystems in which he talked about the ILC model, or, a cycle of Innovate, Leverage, and Commoditize.

He touched on Serverless technologies such as AWS Lambda (which we’re using heavily at Yubl). At the moment it’s at the Innovate stage where it’s still a poorly defined concept and even those involved are still working out how best to utilise it.

If AWS Lambda functions are your nano-services, then on the other end of the scale both AWS and Azure are going to release VMs with terabytes of memory to the general public soon – which will have a massive impact on systems such as in-memory graph databases (eg. Neo4j).

When we move to the Leverage stage, the concepts have been clearly defined and terminologies are widely understood. However, the implementations are not yet standardised, and the challenge at this stage is that you can end up with too many choices as new vendors and solutions compete for market share as mainstream adoption gathers pace.

This is where we’re at with container schedulers – Docker Swarm, Kubernetes, Nomad, Mesos, CloudFoundry and whatever else pops up tomorrow.

As the technology matures and people work out the core set of features that matter to them, it’ll start to become a Commodity – this is where we’re at with running containers – where there are multiple compatible implementations that are offered as services.

This new form of commodity then becomes the base for the next wave of innovations by providing a platform that you can build on top of.

Simon Wardley also talked about this as the cycle of War, Wonder and Peace.