CodeMotion 15–Takeaways from “Measuring micro-services”

This talk by Richard Rodger (of nearForm) was my favourite at this year’s CodeMotion conference in Rome, where he talked about why we need to change the way we think about monitoring when it comes to measuring micro-services.

 

TL; DR

Identify invariants in your system and use them to measure the health of your system.

 

When it comes to measuring the health of a human being, we don’t focus on the minute details and instead we monitor emerging properties such as pulse, temperature and blood pressure.

Similarly, for micro-services, we need to go beyond the usual metrics of CPU and network flow and focus on the emerging properties of our system. When you have systems with 10s or 100s of moving parts, those rudimentary metrics alone can no longer tell you if your systems are in rude health.

 

Message flow rate

Messages are fundamental to micro-services style architectures and message behaviour has emergent properties that can be measured.

For instance, you can monitor rates of user logins, or no. of messages sent to downstream systems. And when it comes time to do a rolling deployment (or canary deployment, or blue/green deployment, etc.) then noticeable changes to these message rates is an indicator of bugs in your new code.

 

Message Patterns

Here  are some simplified message patterns that have emerged from Richard’s experience of building micro-services.

image

Interestingly, when your micro-services architecture grows organically they tend to end up looking like the Tree pattern over time. Even if you hadn’t designed them that way.

 

Why Micro-services?

As many others have talked about, micro-services architectures are more complex, so why go down this road at all?

Because it helps us deal with deployment risks and move away from the big-bang deployments associated with monolithic systems (a case in point being the Knight Capital Group tragedy).

 

Risk is inevitable, there’s no way of completely removing risk associated with a project, but we can reduce it. The current best practices of unit tests and code reviews aren’t good enough because they only cover the problems that we can anticipate and they don’t really tell us how much our risk has been reduced by.

(sidebar : this is also why property-based testing is so great. It moves us away from thinking about specific test cases (i.e. problems that we can anticipate), to thinking about properties and invariants about our system. Scott Wlaschin wrote two great posts on adopting property-based testing in F#.)

 

Instead, we have to accept that the system can fail in ways we can’t anticipate, so it’s more important to measure important behaviours in production rather than trying to catch every possible case in development.

(sidebar : This echoes the message that I have heard time and again from the Netflix guys. They have also built great tooling to make it easy for them to do canary deployment; measure and confirm system behaviour before rolling out changes; and quickly rollback if necessary.)

 

It’s possible to do this with monoliths too, with Facebook and Github being prime examples. The way they do it is to use feature flags. The downside of this approach is that you end up with messier code because it’s littered with if-else statements.

With micro-services however, your code don’t have to be messy anymore.

(sidebar : on the related note, Sam Newman also pointed out a number of other benefits with micro-services style architectures.)

 

Formal Methods

Richard proposed the use of Formal Methods to decide what to measure in a system. He then gave a shout out to TLA+ by Leslie Lamport (of the Paxos and Vector Clock fame), which was used by AWS to verify DynamoDB and S3.

The important thing is to identify invariants in your system – things that should always be true about your system – and measure those.

For example, given two services – shopping cart & sales tax – as below, the ratio of message rate (i.e. requests) to these services should be 1:1.

Even as the message rates themselves change throughout the day (as dictated by user activity), this 1:1 ratio should always hold. And that is an invariant of our system that we can measure!

image

image

The same invariant applies to any system that follows the Chain pattern we mentioned earlier. In fact, each of the patterns we saw earlier give way to a natural invariant:

image

 

All your unit tests are checking what can go wrong. By looking for cause/effect relationships and measuring invariants, we turn that on its head and instead validate what must go right in production.

“ask not what can go wrong, ask what must go right…”

– Chris Newcombe, AWS

It becomes an enabler for doing rapid deployment.

When you do a partial deployment and see that it has broken your invariants, then you know it’s time to rollback.

Richard then talked about Gilt’s use of micro-services, who incidentally, gave a talk at CraftConf on just that, which you can read all about here.

 

In the Oil industry, they have a rule-of-three that says if three of the safety measures are close to critical levels then it counts as a problem. Even if none of the measures have gone over critical levels individually.

We can apply the same idea in our own systems, where we can use measures that are approaching failure as indicator for:

  • risk of doing a deployment
  • risk of a system failure in the near future

 

(sidebar : one aspect of measurement that Richard didn’t touch on in his talk is granularity of your measurement– minutes, seconds, milliseconds. It determines your time to discovery of a problem, and therefore the subsequent time to correction.

This has been a hot topic at the Monitorama conferences and ex-Netflix cloud architect Adrian Cockcroft did a great keynote on it last year.)

 

 

Links

A consistent approach to track correlation IDs through microservices

One of my key takeaways from Tammer Saleh’s microservices anti-patterns talk at CraftConf is that we need to record correlation IDs to help us debug microservices.

 

Why do we need Correlation IDs?

Suppose a user request comes in and after various aspects of the request has been handled by several services something goes wrong.

Even though every service would have been logging along the way it’s no easy task to find all the relevant log messages for this request amongst millions of log messages!

image

This is the problem that correlation IDs help us solve.

The idea is simple. When a user-facing service receives a request it’ll create a correlation ID, and:

  • pass it along in the HTTP header to every other service
  • include it in every log message

The correlation ID can then be used to quickly find all the relevant log messages for this request (very easy to do in Elasticsearch).

image

Additionally, you might wish to pass along other useful contextual information (such as payment ID, tournament ID, etc.) about the request to downstream services. Otherwise they would only appear in log messages from services where they originated from.

It’s helpful to have these contextual information in the error log message so that you don’t have to scan through all relevant messages (by correlation ID) to piece them together.

 

Whilst it’s a very simple pattern, it’s still one that has to be applied in every log message and every HTTP request and HTTP handler. That’s a lot of unwanted cognitive and development overhead.

In his talk,Tamer suggested that it’s just a fact of life that you have to live with, and that every team has to remember to implement this pattern in every service they create.

To me that would be setting my fellow developers up to fail – someone is bound to forget at one time or another and the whole thing falls down like a house of cards.

I wanted a more systematic approach for our team.

 

The Approach

When it comes to implementation patterns like this, my natural tendency is to automate them with PostSharp. However in this case, it doesn’t seem to be a suitable solution because we need to control too many different components:

  • HTTP handler to parse and capture correlation ID passed in via HTTP headers
  • logger to inject the captured correlation IDs
  • HTTP client to include captured correlation IDs as HTTP headers

Instead, this looks like a battle that needs to be fought on multiple fronts.

image

Fortunately we already have abstractions in the right places!

Logger

Our custom log4net appender can be augmented to look for any captured correlation IDs whenever it logs a message to Elasticsearch so they each appear as a separate, searchable field.

HTTP Client

We also have a custom HttpClient that wraps the BCL HttpClient so that:

  • on timeouts, the client throws a TimeoutException instead of the confusing TaskCanceledException
  • by default, there is a short timeout of 3 seconds
  • by default, the client allows 10 consecutive timeouts before tripping the circuit breaker for 30 seconds (via Polly)
  • enables HTTP caching (which by default, is disabled for some reason)
  • has built-in support for HEAD, OPTIONS and PATCH

This HttpClient is also augmented to look for any captured correlation IDs and send them along as  headers in every HTTP call.

Providing the Correlation IDs

For the logger and HTTP client, they need to have a consistent way to get the captured correlation IDs for the current request. That’s the job of the CorrelationIds class below.

The tricky thing is that depending on the hosting framework for your web application – we use both Nancy and Asp.Net WebApi – you might need a different way to track the correlation IDs, hence the ICorrelationIdsProvider abstraction.

Rather than forcing the consumer to call CorrelationIds.ProvidedBy, I could have tied in with an IoC container instead. I chose not to because I don’t want to be tied down by our current choice of IoC container.

HTTP Handler (Nancy)

Both Nancy and Asp.Net have mechanisms for injecting behaviour at various key points in the HTTP request processing pipeline.

In Nancy’s case, you can have a custom bootstrapper:

But we still need to implement the aforementioned ICorrelationIdsProvider interface.

Unfortunately, I can’t use HttpContext.Current in Nancy, so we need to find an alternative way to ensure the captured correlation IDs flow through async-await operations. For this I used a modified version of Stephen Cleary’s AsyncLocal to make sure they flow through the ExecutionContext.

A simplified version of the NancyCorrelationIdsProvider implementation looks like the following:

HTTP Handler (Asp.Net)

For Asp.Net WebApi projects, since we can rely on HttpContext.Current to do the dirty work, so the implementation for ICorrelationIdsProvider becomes a trivial exercise.

What wasn’t immediately obvious to me, was how to tap into the request processing pipeline in a way that can be reused. After some research, custom HTTP modules seem to be the way to go.

However, with custom HTTP modules, they still need to be registered in the web application in the web.config.

 

So, that’s it, what do you think of our approach?

This is the best I have come up with so far, and it has a number of obvious limitations:

  • usable only from .Net, our Erlang-based services would also need something similar
  • only works with web services, and doesn’t extend to queue workers (for Amazon SQS) and stream processors (for Amazon Kinesis)
  • still require initial wiring up (can be mitigated with scaffolding)

If you have some ideas for improvement I would very much like to hear it!

 

Links

CraftConf 15–Takeaways from “Microservice AntiPatterns”

This is another good talk on micro-services at CraftConf, where Tammer Saleh talks about common antipatterns with micro-services and some of the ways you can avoid them.

Personally I think it’s great that both Tammer and Adrian have spent a lot of time talking about challenges with micro-services at CraftConf. There has been so much hype around micro-services that many people forget that like most things in technology, micro-service is an architectural style that comes with its own set of trade-offs and not a free lunch.

 

Why Micro-Services?

Monolithic applications are hard to scale, and it’s impossible to scale the engineering team around it. Because everybody has to work on the same codebase and the whole application is deployed and updated as one big unit.

Then came SOA, which had the right intentions but was too focused on horizontal middleware instead of functional verticals. It was also beset by greedy consultants and middleware vendors.

And finally we have arrived at the age of micro-services, which embraces openness, modularity and maintainability. And most importantly, it’s about the teams and gives you a way to scale your engineering organization.

Within a micro-service architecture you can have many small, focused teams (e.g. Amazon’s 2 pizza rule) that have autonomy over technologies, languages, deployment processes and be able to practice devops and own their services in production.

 

Anti-Patterns

But, there are plenty of ways for things to go wrong. And here are some common mistakes.

 

Overzealous Services

The most common mistake is to go straight into micro-services (following the hype perhaps?) rather than starting with the simplest thing possible.

Remember, boring is good.

Microservices are complex and it adds a constant tax to your development teams. It allows you to scale your development team very large, but it slows each of the teams down, and increases communication overhead.

Instead, you should start by doing the simplest thing possible, identify the hotspots as they become apparent and then extract them into micro-services (this is also how Google does it).

 

Schemas Everywhere

Another common mistake is to have a shared database between different services. This creates tight coupling between the services and means you can’t deploy the micro-services independently if it breaks the shared schema. Any schema updates will force you to update other micro-services too.

Instead, every service should have its own database, and when you want data from another service you go through its API rather than reaching into its database and help yourself.

But then, services still have to talk to one another and there’s still a schema in play here at the communication layer. If you introduce a new version of your API with breaking changes then you’ll break all the services that depend on you.

A solution here would be to use semantic versioning and deploy breaking changes side by side. You can then give your consumers a deprecation period, and get rid of the old version when no one is using it.

BUT, as anyone who’s ever had to deprecate anything would tell you, doesn’t matter how much notice you give someone they’ll always wait till the last possible moment.

Facebook has (or had, not sure if this is still the case) an ingenious solution to this problem. As the deadline for deprecation nears they’ll occasionally route some percentage of your calls to the new API. This raises sufficient no. of errors in your application to get your attention without severely compromising your business. It could start off with 10% of calls in a 5 minute window once every other day, then once a day, then a 10 minute window, and so on.

The idea is simple, to give you the push you need and a glimpse of what would happen if you miss the deadline!

 

Spiky Load between Services

You often get spiky traffic between services, and a common solution is to amortise the load by using queues between the services.

 

Hardcoded IPs and Ports

A common sin that most of us have committed. It gets us going quickly but then makes our lives hell when we need to configure/manage multiple environments and deployments.

One solution would be to use a discovery service such as consul or etcd, and there’s also Netflix’s eureka.

In this solution, each service would first contact the discovery service to find out where other services are:

consul-demo

Service A’ll use that address until Service B can no longer be reached at the address or some timeout has occurred, in which case it’ll ask the discovery service for Service B’s location again.

On a side note, the Red-White Push mechanism I worked on a while back works along similar lines. It solves a different problem (routes client apps to corresponding service cluster based on version) but the idea behind it is the same.

red-white-push

 

Another solution is to use a centralised router.

centralized_router

In this solution, services access each other via a known URL via the router and you can easily change the routes programmatically.

 

Both solutions require registration and deregistration, and both require high availability and scalability.

A discovery service is simpler to build and scale, since it doesn’t need to route all traffic and so doesn’t need to scale as much. But an application needs to know about the discovery service in order to talk to it, so it requires more work to integrate with.

A router on the other hand,  needs to handle all traffic  in your system and it incurs an additional round-trip for every call. But it works transparently and is therefore easier to integrate and work with, and it can also be exposed externally too.

 

Dogpiles

If one of your services is under load or malfunctioning, and all your other services keep retrying their failed calls, then the problem would be compounded and magnified by the additional load from these retries.

Eventually, the faltering service will surfer a total outage, and potentially set off a cascading failure throughout your system.

The solution here is to have exponential backoff and implement the circuit breaker pattern.

Both patterns come high up the list of good patterns in Michael Nygard’s Release It! and adopted by many libraries such as Netflix’s Hystrix and Polly for .Net.

image

You can even use the service discovery layer to help propagate the message to all your services that a circuit has been tripped (i.e. a service is struggling).

This introduces an element of eventual consistency (local view of the service’s health vs what the discovery service is saying). Personally, I find it sufficient to let each service manage its own circuit breaker states so longer you have sensible settings for:

  • timeout
  • no. of timed out requests required to trip the circuit
  • amount of time the circuit is broken for before allowing one attempt through

 

Using a discovery service does have the advantage of allowing you to manually trip the circuit, e.g. in anticipation to a release where you need some down time.

But, hopefully you won’t need that because you have already invested in automating your deployment and can deploy new versions of your service without down time Winking smile

 

Debugging Hell

Debugging is always a huge issue in micro-service architectures. E.g., a nested service call fails and you can’t correlate it back to a request coming from the user.

One solution is to use correlation IDs.

When a request comes in, you assign the request with a correlation ID and pass it on to other services in the HTTP request header. Every service would do the same and pass the correlation ID it receives in the incoming HTTP header in any out-going requests. Whenever you log a message, be sure to include this correlation ID in the log line.

It’s a simple pattern, but it’s one that you’ll have to apply in every one of your services and difficult to transparently implement in the entire dev team.

I don’t know how it’d look but I imagine you might be to able to automate this using an AOP framework like PostSharp. Alternatively, if you’re using some form of scaffolding when you start work on a new service, maybe you can include some mechanism to capture and forward correlation IDs.

 

Missing Mock Servers

When you have a service that other teams depend on, each of these teams would have to mock and stub your service in order to test their own services.

A good step in the right direction is for you to own a mock service and provide it to consumers. To make integration easier, you can also provide the client to your service, and support using the mock service via configuration. This way, you make it easy for other teams to consume your service, and to test their own services they just need to configure the client to hit the mock service instead.

You can take it a step further, by building the mock service into the client so you don’t even have to maintain a mock service anymore. And since it’s all transparently done in the client, your consumers would be none the wiser.

This is a good pattern to use if there is a small number of languages you need to support within your organization.

 

To help you get started quickly, I recommend looking at Mulesoft’s Anypoint Platform. It comes with an API designer which allows you to design your API using RAML and there’s a built-in mock service support that updates automatically as you save your RAML file.

image

It’s also a good way to document your API. There’s also a live API portal which is:

  • interactive – you can try out the endpoints and make requests against the mock service
  • live – it’s automatically updated as you update your API in RAML
  • good for documenting your API and sharing it with other teams

gladius_portal

We have been using it for some time and it has proven to be a useful tool to allow us to collaborate and iteratively design our APIs before we put our heads down and code. We have been able to identify information needs not being met and impedance mismatches early and reduced the amount of back-and-forth required.

 

Flying Blind

With a micro-service architecture the need for on-going operational metrics is greater than ever before, without them you are just flying blind.

graphs_graphs_and_more_graphs

There are more than a handful of tools available in this space. There are commercial tools (some with free tiers) such as NewRelic and StackDriver (now integrated into Google AppEngin), AWS also offers CloudWatch as part of its ecosystem. In the open source space, Netflix has been leading the way with Hystrix and  something even more exciting.

The granularity of the metrics is also important. When your metrics are measured in 1 minute intervals (e.g. CloudWatch) then your time to discovery of problems could be in the 10-15 minutes range, which pushes time to recover even further out. This is why Netflix built its own monitoring system called Atlas.

 

Also, you get this explosion of technologies from languages, runtimes and databases, which can be very painful from an operations point of view.

When you introduce containers, this problem is compounded even further because containers are so cheap and each VM can run 100s even 1000s of containers, each potentially running a different technology stack…

The organizational solution here would be to form a team to build tools that enable developers to manage the system in an entirely automated way. This is exactly what Netflix has done and a lot of amazing tools have come out of there and later open sourced.

 

Summary

  • Start boring and extract to services
  • Understand the hidden schemas
  • Amortize traffic with queues
  • Decouple through discovery tools
  • Contain failures with circuit breakers
  • Use mockable clients.
  • Build in operational metrics from the beginning

 

 

Links

CraftConf 15–Takeaways from “Scaling micro-services at Gilt”

There were a couple of micro-services related talks at this year’s edition of CraftConf. The first was by Adrian Trenaman of Gilt, who talked about their journey from a monolithic architecture to micro-services, and from self-managed datacentres to the cloud.

 

From Monolith to Micro-Services

They started off with a Ruby on Rails monolithic architecture in 2007, but quickly grew to a $400M business within 4 years.

image

With that success came the scalability challenge, one that they couldn’t meet with their existing architecture.

It was at this point that they started to split up their monolith.

2011

A number of smaller, but still somewhat monolithic services were created – product service, user service, etc. But most of the services were still sharing the same Postgres database.

The cart service was the only one that had its own dedicated datastore, and this proved to be a valuable lesson to them. In order to evolve services independently you need to severe the hidden coupling that occurs when services share the same database.

image

Adrian also pointed out that, whilst some of the backend services have been split out, the UI applications (JSP pages) were still monoliths. Lots business logic such as promos and offers were hardcoded and hard/slow to change.

Another pervasive issue is that services are all loosely typed – everything’s a map. This introduced plenty of confusion around what’s in the map. This was because lots of their developers still hadn’t made the mental transition to working in a statically typed language.

2015

Fast forward to 2015, and they’re now creating micro-services at a much faster rate using primarily Scala and the Play framework.image

They have left the original services alone as they’re well understood and haven’t had to change frequently.

The front end application has been broken up into a set of granular, page-specific applications.

There has also been significant cultural changes:

  • emergent architecture design rather than top-down
  • technology decisions are driven by KPI where appropriate; or simple goals where there is no measurable KPIs
  • fail fast, and be open about it so that lessons can be shared amongst the organization; they even hold regular meetings to allow people to openly discuss their failures

To the Cloud

Alongside the transition to micro-services, Gilt also migrated to AWS using a hybrid approach via Amazon VPC.

Every team has its own AWS account as well as a budget. However, organization of teams can change from time to time, but with this setup it’s difficult to move services around different AWS accounts.

Incremental Approach

One thing to note about Gilt’s migration to micro-services is that it took a long time and they have taken an incremental approach.

Adrian explained this as down to them prioritizing getting into market and extracting real business values over technical evolution.

Why Micro-Services

  • it lessens inter-dependencies between teams
  • faster code-to-production
  • allows lots of initiatives in parallel
  • allows different language/framework to be used by each team
  • allows graceful degradation of services
  • allows code to be easily disposable – easy to innovate, fail and move on; Greg Young also touched on this in his recent talk on optimizing for deletability

Challenges with Micro-Services

Adrian listed 7 challenges his team came across in their journey.

Staging

They found it hard to maintain staging environments across multiple teams and services. Instead, they have come to believe that testing in production is the best way to go, and it’s therefore necessary to invest in automation tools to aid with doing canary releases.

I once heard Ben Christensen of Netflix talk about the same thing, that Netflix too has come to realize that the only way to test a complex micro-services architecture is to test it in production.

That said, I’m sure both Netflix and Gilt still have basic tests to catch the obvious bugs before they release anything into the wild. But these tests would not sufficiently test the interaction between the services (something Dan North and Jessica Kerr covered in their opening keynote Complexity is Outside the Code).

To reduce the risk involved with testing in production, you should at least have:

  • canary release mechanism to limit impact of bad release and;
  • minimize time for roll back;
  • minimize time to discovery for problems by having granular metrics (see ex-Netflix architect Adrian Cockcroft’s talk at Monitorama 14 on this)

Ownership

Who owns the service, and what happens if the person who created the service moves onto something else?

Gilt’s decision is to have teams and departments own the service, rather than individual staff.

Deployment

Gilt is building tools over Docker to give them elastic and fast provisioning. This is kinda interesting, because there are already a number of such tools/services available, such as:

It’s not clear what are missing from these that is driving Gilt to build their own tools.

Lightweight API

They have settled on a REST-style API, although I don’t get the sense that Adrian was talking about REST in the sense that Roy Fielding described – i.e. driven by hypermedia.

They also make sure that the clients are totally decoupled from the service, and are dumb and have zero dependencies.

Adrian also gave a shout out apidoc. Personally, we have been using Mulesoft’s Anypoint Platform and it has been a very useful tool for us to document our APIs.

Audit & Alerting

In order to give your engineers full autonomy in production and still stay compliant, you need good auditing and alerting capabilities.

Gilt built a smart alerting tool called CAVE, which alerts you when system invariants – e.g. number of orders shipped to the US in 5 min interval should be greater than 0 – have been violated.

Monitoring of micro-services is an interesting topic because once again, the traditional approach to monitoring – alert me when CPU/network or latencies go above threshold – is no longer sufficient because of the complex effects of causality that runs through your inter-dependent services.

Instead, as Richard Rodger pointed out in his Measuring Micro-Services talk at Codemotion Rome this year, it’s far more effective to identify and monitor emerging properties instead.

If you live in Portland then it might be worth going to Monitorama in June, it’s a conference that focuses on monitoring. Last year’s conference had some pretty awesome talks.

IO Explosions

I didn’t get a clear sense of how it works, but Adrian mentioned that Gilt has looked to lambda architecture for their critical path code, and are doing pre-computation and real-time push updates to reduce the number of IO calls that occur between services.

Reporting

Since they have many databases, it means their data analysts would need access to all the databases in order to run their reports. To simplify things, they put their analytics data into a queue which is then written to S3.

Whilst Adrian didn’t divulge on what technology Gilt is using for analytics, there are quite a number of tools/services available to you on the market.

 

Based on our experience at Gamesys Social where we have been using Google BigQuery for a number of years, I strongly recommend that you take it into consideration. It’s a managed service that allows you to run SQL-like, ad-hoc queries on Exabyte-size datasets and get results back in seconds.

At this point we have around 100TB of analytics data at Gamesys Social and our data analysts are able to run their queries everyday and analyse 10s of TBs of data without any help from the developers.

BigQuery is just one of many interesting technologies that we use at Gamesys Social, so if you’re interested in working with us, we’re hiring for a functional programmer right now.

 

 

Links

QCon London 2015–Takeaways from “Service Architectures at Scale, Lessons from Google and eBay”

Day three of QCon London was a treat, with full day tracks on architecture and microservices, it presented some nice challenges of what to see during the day.

My favourite talk of the day was Randy Shoup’s Service Architectures at Scale, Lessons from Google and eBay.

 

Randy kicked off the session by identifying a common trend in the architecture evolution at some of the biggest internet companies.

image

An ecosystem of microservices also differ from their monolithic counterparts in that they tend to organically form many layers of dependencies rather than fall into strict tiers in a hierarchy.

 

At Google, there has never been a top-down design approach to building systems, but rather an evolutionary process using natural selection – services survive  by justifying their existence through usage or they are deprecated. What appears to be a clean layering by design turned out to be an emergent property of this approach.

image

Services are built from bottom-up but you can still end up with clean, clear separation of concerns.

 

At Google, there are no “architect” roles, nor is there a central approval process for technology decisions. Most technology decisions are made within the team, so they’re empowered to make the decisions that are best for them and their service.

This is in direct contrast to how eBay operated early on, where there was an architecture review board which acted as a central approval body.

 

Even without the presence of a centralized control body, Google proved that it’s still possible to achieved standardization across the organization.

Within Google, communication methods (e.g.. network protocol, data format, structured way of expressing interface, etc.) as well as common infrastructure (source control, monitoring, alerting, etc.) are standardized by encouragement rather than enforcement.

image

By the sound of it, best practices and standardization are achieved through a consensus-based approach in teams and then spread out  throughout the organization through:

  • encapsulation in shared/reusable libraries;
  • support for these standards in underlying services;
  • code reviews (word of mouth);
  • and most importantly the ability to search all of Google’s code to find existing examples

One drawback with following existing examples is the possibility of random anchoring – someone at one point made a decision to do things one way and then that becomes the anchor for everyone else who finds that example thereafter.

image

image

Whilst the surface areas of services are standardized, the internals of the services are not, leaving developers to choose:

  • programming language (C++, Go, Python or Java)
  • frameworks
  • persistence mechanisms

image

 

Rather than deciding on the split of microservices up ahead, capabilities tend to be implemented in existing services first to solve specific problems.

If it prove to be successful then it’s extracted out and generalized as a service of its own with a new team formed around it. Many popular services today all started life this way – Gmail, App Engine and BigTable to name a few.

 

On the other hand, a failed service (e.g. Google Wave) will be deprecated but reusable technology would be repurposed and the people in the team would be redeployed to other teams.

 

This is a fairly self-explanatory slide and an apt description of what a microservice should look like.

image

 

As the owner of a service, your primary focus should be the needs of your clients, and to meet their needs at minimum cost and effort. This includes leveraging common tools, infrastructures and existing service as well as automating as much as possible.

The service owner should have end-to-end ownership, and the mantra should be “You build it, you run it”.

The teams should have autonomy to choose the right technology and be held responsible for the results of those choices.

 

Your service should have a bounded context, its primary focus should be on the client and services that depend on the service.

You should not have to worry about the complete ecosystem or the underlying infrastructure, and this reduced cognitive load also means the teams can be extremely small (usually 3-5 people) and nimble. Having a small team also bounds the amount of complexity that can be created (i.e. use Conway’s law to your advantage).

 

Treat service-service relationship as a vendor-client relationship with clear ownership and division of responsibility.

To give people the right incentives, you should charge for usage of the service, this way, it aligns economic incentives for both sides to optimize for efficiency.

With a vendor-client relationship (with SLAs and all) you’re incentivized to reduce the risk that comes with making changes, hence pushing you towards making small incremental changes and employing solid development practices (code reviews, automated test, etc.).

 

You should never break your clients’ code, hence it’s important to keep backward/forward compatibility of interfaces.

You should provide an explicit deprecation policy and give your clients strong incentives to move off old versions.

 

Services at scale are highly exposed to performance variability.

image

Tail latencies (e.g. 95%, 99% latency) are much more important than average latencies. It’s easier for your client to program to consistent performance.

 

Services at scale are also highly exposed to failures.

(disruptions are 10x more likely from human errors than software/hardware failures)

You should have resilience in depth with redundancy for hardware failures, and have capability for incremental deployments:

  • Canary releases
  • Staged rollouts
  • Rapid rollbacks

eBay also use ‘feature flags’ to decouple code deployment from feature deployment.

And of course, monitoring..

image

 

Finally, here are some anti-patterns to look out for:

Mega-Service – services that does too much, ala mini-monolith

Shared persistence – breaks encapsulation, and encourages ‘backdoor’ violation, can lead to hidden coupling of services (think integration via databases…)

 

Gamesys Social

As I sat through Randy’s session, I was surprised and proud to find that we have employed many similar practices in my team (backend team at Gamesys Social), a seal of approval if you like:

  • not having architect roles, instead using a consensus-based approach to make technology decisions
  • standardization via encouragement
  • allow you to experiment with approaches/tech and not penalizing you when things don’t pan out (the learning is also a valuable output from the experiment)
  • organic growth of microservices (proving them in existing services first before splitting out and generalize)
  • place high value on automation
  • autonomy to the team, and DevOps philosophy of “you build it, you run it”
  • deployment practices – canary release, staged rollouts, use of feature flags and our twist on the blue-green deployment

 

I’m currently looking for some functional programmers to join the team, so if this sounds like the sort of environment you would like to work in, then have a look at our job spec and apply!

st119-3992

 

Links

Slides for the talk

We’re hiring a Functional Programmer!