CraftConf 15–Takeaways from “Jepsen IV: Hope Springs Eternal”

This talk by Kyle Kingsbury (aka @aphyr on twitter) was my favourite at CraftConf, and gave us an update on the state of consistency with MongoDB, Elasticsearch and Aerospike.

 

Kyle opened the talk by talking about how we so often build applications on top of databases, queues, streams, etc. and that these systems we depend on are really quite flammable (hence the tyre analogy).

image

..as anybody who’s ever used any database knows, everything is on fire all the time! But our goal is to pretend, and ensure that everything still works… we need to isolate the system from failures.

 

– Kyle Kingsbury

image

 

Which led nicely into the type of failures that the rest of the talk will focus on – split brain, broken foreign keys, etc. And the purpose of his Jepsen project is to analyse a system against these failures.

 

A system has boundaries, and these boundaries should be protected by a set of invariants – e.g. if you put something into a queue then you should be able to read it out afterwards.

The rest of the talk splits into two halves.

The 1st half builds up a model for talking about consistency:

image

and the 2nd half of the talk looked at a number of specific instances of databases – Elasticsearch, MongoDB and AeroSpike – and see how they stacked up against the consistency guarantees they claim to have.

 

Rather than trying to explain them here and doing a bad job of it, I suggest you read Kyle’s post on the different consistency models from his diagram.

It’s a 15-20 mins read, after which you might also be interested to give these two posts a read too:

 

Instead I’ll just list a few key points I noted during the session:

  • CAP theorem tells us that a linearizable system cannot be totally available
  • for the consistency models in red, you can’t have total availability (the A in CAP) during a partition
  • for total availability, look to the area
  • weaker consistency models are more available in case of failure
  • weaker consistency models are also less intuitive
  • weaker consistency models are faster because they require less coordination
  • weak is not the same as unsafe – safety depends on what you’re trying to do, e.g. eventual consistency is ok for counters, but for claiming unique usernames you need linearizability

Kyle’s Jepsen client uses black-box testing approach to test database systems (i.e. only looking at results from a client’s perspective) whilst inducing network partitions to see how the database behaves during a partition.

The clients generate random operations and apply them to the system. Since clients run on the same JVM so you can use linearizable data structures to record a history of results as received by the clients and use that history to detect consistency violations.

This is similar to the generative testing approach used by QuickCheck. Scott Wlaschin has two excellent posts to help you get started with FsCheck, a F# port of QuickCheck.

 

“MongoDB is not a bug, it’s a database”

– Kyle Kingsbury

and thus began a very entertaining second half of the talk as Kyle shared results from his tests against MongoDB, Elasticsearch and AeroSpike.

 

TL;DR

None of the databases were able to meet the consistency level they claim to offer, but at least Elasticsearch is honest about it and doesn’t promise you the moon.

 

Again, seeing as Kyle has recently written about these results in detail, I won’t repeat them here. The talk doesn’t go into quite as much depth, so if you have time I recommend reading his posts:

 

Whilst it was fun watching Kyle shoot holes through these database vendors’ consistency claims, and some of the fun-poking is really quite funny (and well deserved on the vendor’s part).

If there’s one thing you should takeaway from Kyle’s talk, and his work with Jepsen in general is, don’t drink the kool-aid.

Database vendors have a history of over-selling and at times out-right false marketing. As developers, we have the means to verify their claims, so next time you hear a claim that’s too good to be true, verify it, don’t drink the kool-aid.

 

Links

CraftConf 15–Takeaways from “Architecture Without an End State”

In enterprises we tend to sell architectural projects by showing you the current messy state and pitch you an idealized end state. But the problem is that, the ideal end state we’re pitching exists only as an idea, without having gone through all the compromises and organic growth that made the current state messy.

Once we have gone through the process of implementing and growing the system over the course of its lifetime, how do we know we won’t end up in the same messy situation again?

bsg-happened-before

The 3-year Plans

Worse still, these pitches are always followed by a 3 year plan where you see no return on your investment for 2 whole years!

3-year-plan

And since 3 years is a long time, many things can happen along the way – mergers, acquisitions, new technologies, new CTO/CIO, etc. – and often we don’t ever realize the original 3 year plan before these forces of change takes us to a new direction.

And the vicious cycle continues….
3-year-plans

 

Within a large organization there are always multiple forces of change happening at the same time, and your perspective changes depending on your position within the organization.

In a complex sociotechnical system like this, there’s no privileged vantage point. One person cannot understand all ramifications of their decisions, not even the CEO of a company.

The best you can do, is locally contextual actions.

 

8 Rules for Survival

1. Embrace plurality

Avoid tendency for the “one”, e.g.

  • single source of truth;
  • single system of record;
  • one system to rule them all;
  • etc.

The problem with having a single system of records (SSOR) is that it’s terribly fragile. Many real-world events (e.g. M&A, divestitures, entering new market) can easily violate your pursuit for the “one” system.

In epistemology (i.e. the study of “how we know what we know”), there’s the idea that we can only know what we are able to record in a system. If your SSOR is badly shaped for the information that exist, you might be unaware of the rest of the instances of information you’re looking for.

 

When modelling a domain, sometimes it’s even tough to find “one” definition for something that everyone can agree on.

For example, a “customer” can mean different things to different departments:

  • for customer service, a customer is someone who is paying you money
  • for sales, someone is a customer as soon as you start talking to them

 

Finally, there’s this philosophical idea that, the model you use to view the world shapes the thoughts you are able to think.

Your paradigm defines the questions you can ask.

ludwig-wittgenstein

I think this also applies to programming languages and paradigms, with the same detrimental effects in limiting our potential as creative problem solvers. And it’s a good reason why you should learn at least a few programming paradigms in order to expand your mind.


 

When dealing with multiple systems of records, we change our emphasis and instead focus on:

  • who’s the authority for what span?
  • define a system to find the authority for a span given some unique identifier, e.g. URIs, URNs (AWS’s ARN system springs to mind)
  • agreeing on a representation so we can interchange them across different systems
  • enable copies and have a mechanism for synchronization

Something interesting happens when you do these – your information consumers no longer assume where the information comes from.

If you provide full URLs with your product details, then the consumers can go to any of the URLs for your product information. They must become more flexible and assume there is an open world of information out there.

Along with the information you get, you also need a mechanism for determining what actions you can perform with them. These actions can be encoded as URLs and may point to different services.


As I listened, two words sprang to mind:

microservices – breaking up a big, complex system into smaller parts each with authority over its own span; information is accessed through a system of identifiers and an agreed-upon format for interchange (rather than shared database, etc.)

HATEOAS – or REST APIs in the way Roy Fielding originally described; actions permitted on information is pass back alongside the information itself and encoded as URLs

 

2. Contextualize Downstream

We have a tendency to put too much in the data we send downstream, which can often end up being a very large set.

But our business rules are usually contextual:

  • not all information about an entity is required in every case
  • operations are not available across entire classes of entities

Again, HATEOAS strikes me as a particularly good fit here for contextualizing the data we send to downstream systems.


We build systems as a network of data-flows, but we also need to look at the ecosystem of system of systems and be aware of the downstream impacts of our changes.

Something like the following is obviously not desirable!

downstream-impact

So the rule is:

  • Augment the data upstream, add fields but not relationships if you can avoid it.
  • Contextualize downstream, apply your business rules as close to the user as possible. Because as it turns out, the rules that are closest to the users change the most often.
  • Minimize the entities that all systems need to know about.

 

3. Beware Grandiosity

Fight the temptation to build the model to end all models (again, embrace plurality), aka Enterprise Modelling Dictionary.

The problem with an Enterprise Modelling project is that, it requires:

  • a global perspective, which we already said is not available in a complex system;
  • agreement across all business units, which is extraordinarily hard to achieve;
  • talent for abstraction; but also
  • concrete experience in all contexts;
  • a small enough team to make decisions.

Such teams only exist in Hollywood movies..

TheATeam-77904-4

Instead, look to how standard committees work:

  • they compromises a lot;
  • they don’t seek agreements across the board, and instead agree on small subsets and leave room for vendor ventures;
  • they take a long time to converge
  • they deliberately limit their scope

All models are wrong, but some are useful.”

– George Box

So the rule is:

  • don’t try to model the whole world;
  • find compromises and subsets that everyone can agree on;
  • start small and incrementally expand;
  • assume you don’t know everything and leave room for extension;
  • allow very lengthy comment period (with plenty of email reminders since everyone leaves it till the last minute..)

 

4. Decentralize

Decentralization allows more exploration of market space because it allows multiple groups to work on different things at once.

You can also explore the solution space better because different groups can try different solutions for similar problems.

However, you do need some prerequisites for this to work successfully:

  • Transparency – methods, work and results must be visible
  • Isolation – one group’s failure cannot cause widespread damage
  • Economics – create a framework for everyone to make decisions that contribute towards the larger goal

The Boeing 777 project is an excellent example of distributed economic decision-making.

The no. 1 thing that determines a plane’s lifetime cost is the weight of the aircraft. So the weight of the aircraft is in direct trade-off with cost (e.g. lighter materials = more expensive manufacturing cost).

They worked out what the trade-off was and set out rules so that an engineer can buy a pound of weight for $300 – i.e. as an engineer, you can make a decision that’ll shave a pound of weight off the aircraft at the expense of $300 of manufacturing cost without asking anybody.

Engineering and programme managers and above have a bigger manufacturing cost budge per pound, and anything above these predefined limits requires you to go before the governance board and justice your decisions.

What they have done is setting the trade-off policies centrally and by defining the balancing forces globally they have created a framework that allows decisions to be made locally without requiring reviews.

In IT, Michael has observed that centralization tends to lead to budget fights between departments. Whereas decentralization allows for slivers of budgets working towards a common goal.

 

5. Isolate Failure Domains

If you reverse the direction data flows through your systems and you can find the direction of dependencies between them.

dir-dataflow-and-dependencies

And the thing about dependency is that sometimes you can’t depend on them. So you definitely want to isolate yourself from upstream dependencies. In his book, Release It!, Michael outlined a number of useful patterns to help you do just that, e.g. circuit breaker, bulkheads, timeout,  fail fast. Many of these patterns have been incorporated in libraries such as Netflix’s Hystrix.

 

Another kind of failure is when you have missed a solution space entirely. This is where modularity can give you value in that it gives you options that you might exercise in future.

Just as options in the financial sense (a long call in this case), there’s a negative payoff at the start as you need to do some work to create the option. But if you choose to exercise the option in the future then there’s a long positive payoff.

value-of-option

Whilst Michael didn’t spell it out, I think the value of modularity in this case lies in the ability to recover from having missed a solution space the first time around. If the modularity is there then at least you have to option to adopt a different solution at a later point (i.e. exercising the option).


Michael then gave an example of a trading company who has build their system as many small applications integrated by messaging. The applications are small enough and fine grained enough that it’s cheaper to rewrite than to maintain, so they treat these applications as disposable components.


I’d like to give a shout out to Greg Young’s recent talk, The Art of Destroying Software, where he talked at length about building systems whilst optimizing for deletability. That is, you should build systems so that they’re never bigger than what you can reasonably rewrite in a week. And that is as good a guideline on “how big should my microservice be?” as I have heard so far.

 

6 & 7. Data Outlives Applications; Applications Outlives Integrations

As things change, a typical layered architecture breaks when there are pressure from either side.

layered-architecture-breaks

A Hexagonal architecture, or Ports and Adapters style architecture on the other hand, puts your domain in the centre rather than the bottom of the stack. The domain is then connected to the outside world via adapters that bridges between the concepts of that adapter and the concepts of the domain.

hexagonal-architecture

Hexagonal architectures make technology transition more of a boundary layer issue than a domain layer issue, and allows you to explicitly represent multiple adapters from the same domain for different purposes.

 

8. Increase Discoverability

In large organizations you often find people working on projects with significant amounts of overlap due to lack of discoverability about what other people are working on.

Just as one would search Github or Google to see if a similar project exists already before embarking on his/her own open source alternative, we should be doing more to enable discoverability within companies.

Doing so allows us to move faster by building on top of others work, but in order to do this we need to make our work visible internally:

  • internal blogs
  • open, easily accessible code repositories
  • modern search engine

Internal teams should work like open source projects whilst retaining a product mentality.

One thing to watch out for is the budget culture where:

  • contributions to other people’s projects are not welcomed due to budget fights;
  • inquiries about each other’s work are alarming and considered precursor to their attempt to suck your budget into theirs

Instead you should embrace an engineering culture which tend to be much more open. You could start to build up an engineering culture by:

  • having team blogs
  • allow users to report bugs
  • make CI servers public

 

 

Finally, stop chasing idealized end state, and embrace the wave of change.

 

I was at QCon London earlier this year, and in Kevlin Henney’s Small is Beautiful talk he delivered a similar message. That we should think of software as products, not projects.  If software are projects then they should have well-defined end state, but most often, software do not have well-defined end state, but rather evolved continuously for as long as it remains desirable and purposeful.

 

 

Links

 

 

CraftConf 15–Takeaways from “Microservice AntiPatterns”

This is another good talk on micro-services at CraftConf, where Tammer Saleh talks about common antipatterns with micro-services and some of the ways you can avoid them.

Personally I think it’s great that both Tammer and Adrian have spent a lot of time talking about challenges with micro-services at CraftConf. There has been so much hype around micro-services that many people forget that like most things in technology, micro-service is an architectural style that comes with its own set of trade-offs and not a free lunch.

 

Why Micro-Services?

Monolithic applications are hard to scale, and it’s impossible to scale the engineering team around it. Because everybody has to work on the same codebase and the whole application is deployed and updated as one big unit.

Then came SOA, which had the right intentions but was too focused on horizontal middleware instead of functional verticals. It was also beset by greedy consultants and middleware vendors.

And finally we have arrived at the age of micro-services, which embraces openness, modularity and maintainability. And most importantly, it’s about the teams and gives you a way to scale your engineering organization.

Within a micro-service architecture you can have many small, focused teams (e.g. Amazon’s 2 pizza rule) that have autonomy over technologies, languages, deployment processes and be able to practice devops and own their services in production.

 

Anti-Patterns

But, there are plenty of ways for things to go wrong. And here are some common mistakes.

 

Overzealous Services

The most common mistake is to go straight into micro-services (following the hype perhaps?) rather than starting with the simplest thing possible.

Remember, boring is good.

Microservices are complex and it adds a constant tax to your development teams. It allows you to scale your development team very large, but it slows each of the teams down, and increases communication overhead.

Instead, you should start by doing the simplest thing possible, identify the hotspots as they become apparent and then extract them into micro-services (this is also how Google does it).

 

Schemas Everywhere

Another common mistake is to have a shared database between different services. This creates tight coupling between the services and means you can’t deploy the micro-services independently if it breaks the shared schema. Any schema updates will force you to update other micro-services too.

Instead, every service should have its own database, and when you want data from another service you go through its API rather than reaching into its database and help yourself.

But then, services still have to talk to one another and there’s still a schema in play here at the communication layer. If you introduce a new version of your API with breaking changes then you’ll break all the services that depend on you.

A solution here would be to use semantic versioning and deploy breaking changes side by side. You can then give your consumers a deprecation period, and get rid of the old version when no one is using it.

BUT, as anyone who’s ever had to deprecate anything would tell you, doesn’t matter how much notice you give someone they’ll always wait till the last possible moment.

Facebook has (or had, not sure if this is still the case) an ingenious solution to this problem. As the deadline for deprecation nears they’ll occasionally route some percentage of your calls to the new API. This raises sufficient no. of errors in your application to get your attention without severely compromising your business. It could start off with 10% of calls in a 5 minute window once every other day, then once a day, then a 10 minute window, and so on.

The idea is simple, to give you the push you need and a glimpse of what would happen if you miss the deadline!

 

Spiky Load between Services

You often get spiky traffic between services, and a common solution is to amortise the load by using queues between the services.

 

Hardcoded IPs and Ports

A common sin that most of us have committed. It gets us going quickly but then makes our lives hell when we need to configure/manage multiple environments and deployments.

One solution would be to use a discovery service such as consul or etcd, and there’s also Netflix’s eureka.

In this solution, each service would first contact the discovery service to find out where other services are:

consul-demo

Service A’ll use that address until Service B can no longer be reached at the address or some timeout has occurred, in which case it’ll ask the discovery service for Service B’s location again.

On a side note, the Red-White Push mechanism I worked on a while back works along similar lines. It solves a different problem (routes client apps to corresponding service cluster based on version) but the idea behind it is the same.

red-white-push

 

Another solution is to use a centralised router.

centralized_router

In this solution, services access each other via a known URL via the router and you can easily change the routes programmatically.

 

Both solutions require registration and deregistration, and both require high availability and scalability.

A discovery service is simpler to build and scale, since it doesn’t need to route all traffic and so doesn’t need to scale as much. But an application needs to know about the discovery service in order to talk to it, so it requires more work to integrate with.

A router on the other hand,  needs to handle all traffic  in your system and it incurs an additional round-trip for every call. But it works transparently and is therefore easier to integrate and work with, and it can also be exposed externally too.

 

Dogpiles

If one of your services is under load or malfunctioning, and all your other services keep retrying their failed calls, then the problem would be compounded and magnified by the additional load from these retries.

Eventually, the faltering service will surfer a total outage, and potentially set off a cascading failure throughout your system.

The solution here is to have exponential backoff and implement the circuit breaker pattern.

Both patterns come high up the list of good patterns in Michael Nygard’s Release It! and adopted by many libraries such as Netflix’s Hystrix and Polly for .Net.

image

You can even use the service discovery layer to help propagate the message to all your services that a circuit has been tripped (i.e. a service is struggling).

This introduces an element of eventual consistency (local view of the service’s health vs what the discovery service is saying). Personally, I find it sufficient to let each service manage its own circuit breaker states so longer you have sensible settings for:

  • timeout
  • no. of timed out requests required to trip the circuit
  • amount of time the circuit is broken for before allowing one attempt through

 

Using a discovery service does have the advantage of allowing you to manually trip the circuit, e.g. in anticipation to a release where you need some down time.

But, hopefully you won’t need that because you have already invested in automating your deployment and can deploy new versions of your service without down time Winking smile

 

Debugging Hell

Debugging is always a huge issue in micro-service architectures. E.g., a nested service call fails and you can’t correlate it back to a request coming from the user.

One solution is to use correlation IDs.

When a request comes in, you assign the request with a correlation ID and pass it on to other services in the HTTP request header. Every service would do the same and pass the correlation ID it receives in the incoming HTTP header in any out-going requests. Whenever you log a message, be sure to include this correlation ID in the log line.

It’s a simple pattern, but it’s one that you’ll have to apply in every one of your services and difficult to transparently implement in the entire dev team.

I don’t know how it’d look but I imagine you might be to able to automate this using an AOP framework like PostSharp. Alternatively, if you’re using some form of scaffolding when you start work on a new service, maybe you can include some mechanism to capture and forward correlation IDs.

 

Missing Mock Servers

When you have a service that other teams depend on, each of these teams would have to mock and stub your service in order to test their own services.

A good step in the right direction is for you to own a mock service and provide it to consumers. To make integration easier, you can also provide the client to your service, and support using the mock service via configuration. This way, you make it easy for other teams to consume your service, and to test their own services they just need to configure the client to hit the mock service instead.

You can take it a step further, by building the mock service into the client so you don’t even have to maintain a mock service anymore. And since it’s all transparently done in the client, your consumers would be none the wiser.

This is a good pattern to use if there is a small number of languages you need to support within your organization.

 

To help you get started quickly, I recommend looking at Mulesoft’s Anypoint Platform. It comes with an API designer which allows you to design your API using RAML and there’s a built-in mock service support that updates automatically as you save your RAML file.

image

It’s also a good way to document your API. There’s also a live API portal which is:

  • interactive – you can try out the endpoints and make requests against the mock service
  • live – it’s automatically updated as you update your API in RAML
  • good for documenting your API and sharing it with other teams

gladius_portal

We have been using it for some time and it has proven to be a useful tool to allow us to collaborate and iteratively design our APIs before we put our heads down and code. We have been able to identify information needs not being met and impedance mismatches early and reduced the amount of back-and-forth required.

 

Flying Blind

With a micro-service architecture the need for on-going operational metrics is greater than ever before, without them you are just flying blind.

graphs_graphs_and_more_graphs

There are more than a handful of tools available in this space. There are commercial tools (some with free tiers) such as NewRelic and StackDriver (now integrated into Google AppEngin), AWS also offers CloudWatch as part of its ecosystem. In the open source space, Netflix has been leading the way with Hystrix and  something even more exciting.

The granularity of the metrics is also important. When your metrics are measured in 1 minute intervals (e.g. CloudWatch) then your time to discovery of problems could be in the 10-15 minutes range, which pushes time to recover even further out. This is why Netflix built its own monitoring system called Atlas.

 

Also, you get this explosion of technologies from languages, runtimes and databases, which can be very painful from an operations point of view.

When you introduce containers, this problem is compounded even further because containers are so cheap and each VM can run 100s even 1000s of containers, each potentially running a different technology stack…

The organizational solution here would be to form a team to build tools that enable developers to manage the system in an entirely automated way. This is exactly what Netflix has done and a lot of amazing tools have come out of there and later open sourced.

 

Summary

  • Start boring and extract to services
  • Understand the hidden schemas
  • Amortize traffic with queues
  • Decouple through discovery tools
  • Contain failures with circuit breakers
  • Use mockable clients.
  • Build in operational metrics from the beginning

 

 

Links

CraftConf 15–Takeaways from “The Hidden Dimension of Refactoring”

One of the great things about CraftConf is the plethora of big name tech speakers that the organizers have managed to attract.

Michael Feathers is definitely one of the names that made me excited to go to CraftConf.

 

Know your commits

We have a bias towards the current state of our code, but we have a lot more information than that. Along the history of its commits a piece of code can vary greatly in size and complexity.

To help find the places where methods jump the shark, Michael created a tool called Method Shark, you can get it from github here. What he found is that it’s cultural, some places just don’t refactor much whilst others wait until it hits some threshold.

 

We’re at an interesting stage in our industry where we’re doing lots Big Data analysis on customer behaviour and usage trend, but we’re not doing anywhere near enough for ourselves by analysing our commit history.

Those of you who read this blog regularly would know that I’m a big fan of Adam Tornhill’s work in this area. It was great to see Michael give a shout out to Adam’s book at the end of the talk, which is available on Amazon:

image

 

Richard Gabriel noted that biologists have a different notion of modularity to us:

“Biologists think of a module as representing a common origin more so than what we think of as modularity, which they call spatial compartmentalization”

– Richard Gabriel (Design Beyond Human Abilities)

So when biologists talk about modularity, they’re concerned about evolution and ancestry and not just the current state of things.

By recording and analysing interesting events about changes in our commits – event type (add/update), method body, name, etc. – we too, are able to gain deeper into the evolution of our codebase.

 

Commit history analysis

Michael demonstrated a number of analysis which you might take inspirations from and apply on your codebase.

 

Method Lifeline

This maps the size of a particular method in lines over its lifetime.

image

From this picture you can see that this particular method has been growing steadily in size so it’s a good candidate for refactoring.

Method Ascending

You can find methods that have increased in size in consecutive commits to identify methods that are good candidates for refactoring.

It might be that the way to refactor the method is not obvious, so the decision to refactor is delayed and that indecision compounds. The result is that the method keeps growing in size over time.

You can also apply method ascending to a whole codebase. For instance, for the Rails project as of 2014 you might find:

[1827, 765, 88, 13, 2, 1]

i.e. 765 methods have been growing in size in the last 2 consecutive commits, 88 methods have growing in the in size in the last 3 consecutive commits, and so on.

Trending Methods

You can use this analysis to find the most changed methods in say, the last month.

Activity List

Things that are changed a lot are probably worth investigating too. You can identify bouts of activities on a method over a given period. For example, a method that has existed for 18 months were changed during bouts of activities over a period of 9 months.

This helps to identify methods that you keep going back to, which is a sign that they may be violations to single responsibility and/or open-closed principles.

Single Responsibility Principle

If you find methods that are changed together many times, then they might represent responsibilities within the class that we haven’t identified, and therefore are good candidates to be extracted into a separate abstraction.

Open/Closed Principle

Even when you’re working with a large codebase, chances are lots of areas of code don’t change that much.

You can find a “closure date” for each class – the last time each class is modified. Don’t concentrate your refactoring efforts on areas of code that have been closed for a long time.

Temporal Correlation of Class Changes

Find classes that are changed together to find areas in the system that have temporal coupling through coincidental changes happening across different areas.

These might be results of poor abstractions, where responsibilities haven’t been well encapsulated into one class. They can be indicators of areas that might need to be refactored or re-abstracted.

Active Set of Classes

If you plot the no. of classes at a point in time (blue) vs the no. of classes that hasn’t changed again since (green):

image

the distance between the two lines is the number of things you’ll be touching since that point in time.

In the example above, you can see a large number of classes was introduced in the middle of the graph,  and gradually these classes became closed over time.

Looking at another example:

image

we can see that a lot of classes were introduced in two bursts of activities and it took a long time for them to become closed.

Turbulence

This is a measure of the no. of commits on a file vs its complexity.

image

From this graph, we can see that most of the files in this codebase has relatively few number of commits and are of low complex.

Be ware of things in the top-left and top-right quadrants of the graph, they are things that either:

  • gained a lot of complexity without a lot of commits (did one of your hotshots decide to go to work with that file?), or;
  • things that are complex and changes a lot (are these code that are so complex nobody wanna do anything about it? so the level of complexity just grows with each commit)

Proactive refactoring can be a way to stop the complexity of these classes from getting out of control.

Frequency of Inter-Commit Intervals

You can also discover the working style of your developers by looking at the frequency of inter-commit intervals.

image

image

Churn

Churn is the number of times things change. If you look at the repository for ruby-gems, where:

  • every point on the x-axis is a file;
  • and y-axis is the number of times it has changed in the history of the codebase

image

When we think about the single responsibility principle and the open-closed principle, we think about classes that are well abstracted and don’t have to change frequently.

But every codebase Michael looked at has a similar graph to the above where there are areas of code that changes very frequently and areas that hardly changes at all.

Michael then referenced Susanne Siverland’s work on “Where do you save most money on refactor?”. I couldn’t find the paper afterwards, but Michael summarised her finding as – “in general, churn is a great indicator, when you compare it with other things such as complexity, in terms of the payoff you get from refactoring, and in terms of reducing defects.”

In hindsight this is pretty obvious – bugs are there because we put them there, and if we keep going back to the same areas of code then it probably means they’re non-optimal and each time we change something there’s a chance of us introducing a new bug.

If you look at Adam Tornhill’s Code as Crime Scene talk, this is also the basis for his work on geographical profiling of code.

image

 

Finally, keep in mind that these analysis are intended to give you indicators of where  you should be doing targeted refactoring, but at the end of day they’re still just indicators.

 

Links

CraftConf 15–Takeaways from “Scaling micro-services at Gilt”

There were a couple of micro-services related talks at this year’s edition of CraftConf. The first was by Adrian Trenaman of Gilt, who talked about their journey from a monolithic architecture to micro-services, and from self-managed datacentres to the cloud.

 

From Monolith to Micro-Services

They started off with a Ruby on Rails monolithic architecture in 2007, but quickly grew to a $400M business within 4 years.

image

With that success came the scalability challenge, one that they couldn’t meet with their existing architecture.

It was at this point that they started to split up their monolith.

2011

A number of smaller, but still somewhat monolithic services were created – product service, user service, etc. But most of the services were still sharing the same Postgres database.

The cart service was the only one that had its own dedicated datastore, and this proved to be a valuable lesson to them. In order to evolve services independently you need to severe the hidden coupling that occurs when services share the same database.

image

Adrian also pointed out that, whilst some of the backend services have been split out, the UI applications (JSP pages) were still monoliths. Lots business logic such as promos and offers were hardcoded and hard/slow to change.

Another pervasive issue is that services are all loosely typed – everything’s a map. This introduced plenty of confusion around what’s in the map. This was because lots of their developers still hadn’t made the mental transition to working in a statically typed language.

2015

Fast forward to 2015, and they’re now creating micro-services at a much faster rate using primarily Scala and the Play framework.image

They have left the original services alone as they’re well understood and haven’t had to change frequently.

The front end application has been broken up into a set of granular, page-specific applications.

There has also been significant cultural changes:

  • emergent architecture design rather than top-down
  • technology decisions are driven by KPI where appropriate; or simple goals where there is no measurable KPIs
  • fail fast, and be open about it so that lessons can be shared amongst the organization; they even hold regular meetings to allow people to openly discuss their failures

To the Cloud

Alongside the transition to micro-services, Gilt also migrated to AWS using a hybrid approach via Amazon VPC.

Every team has its own AWS account as well as a budget. However, organization of teams can change from time to time, but with this setup it’s difficult to move services around different AWS accounts.

Incremental Approach

One thing to note about Gilt’s migration to micro-services is that it took a long time and they have taken an incremental approach.

Adrian explained this as down to them prioritizing getting into market and extracting real business values over technical evolution.

Why Micro-Services

  • it lessens inter-dependencies between teams
  • faster code-to-production
  • allows lots of initiatives in parallel
  • allows different language/framework to be used by each team
  • allows graceful degradation of services
  • allows code to be easily disposable – easy to innovate, fail and move on; Greg Young also touched on this in his recent talk on optimizing for deletability

Challenges with Micro-Services

Adrian listed 7 challenges his team came across in their journey.

Staging

They found it hard to maintain staging environments across multiple teams and services. Instead, they have come to believe that testing in production is the best way to go, and it’s therefore necessary to invest in automation tools to aid with doing canary releases.

I once heard Ben Christensen of Netflix talk about the same thing, that Netflix too has come to realize that the only way to test a complex micro-services architecture is to test it in production.

That said, I’m sure both Netflix and Gilt still have basic tests to catch the obvious bugs before they release anything into the wild. But these tests would not sufficiently test the interaction between the services (something Dan North and Jessica Kerr covered in their opening keynote Complexity is Outside the Code).

To reduce the risk involved with testing in production, you should at least have:

  • canary release mechanism to limit impact of bad release and;
  • minimize time for roll back;
  • minimize time to discovery for problems by having granular metrics (see ex-Netflix architect Adrian Cockcroft’s talk at Monitorama 14 on this)

Ownership

Who owns the service, and what happens if the person who created the service moves onto something else?

Gilt’s decision is to have teams and departments own the service, rather than individual staff.

Deployment

Gilt is building tools over Docker to give them elastic and fast provisioning. This is kinda interesting, because there are already a number of such tools/services available, such as:

It’s not clear what are missing from these that is driving Gilt to build their own tools.

Lightweight API

They have settled on a REST-style API, although I don’t get the sense that Adrian was talking about REST in the sense that Roy Fielding described – i.e. driven by hypermedia.

They also make sure that the clients are totally decoupled from the service, and are dumb and have zero dependencies.

Adrian also gave a shout out apidoc. Personally, we have been using Mulesoft’s Anypoint Platform and it has been a very useful tool for us to document our APIs.

Audit & Alerting

In order to give your engineers full autonomy in production and still stay compliant, you need good auditing and alerting capabilities.

Gilt built a smart alerting tool called CAVE, which alerts you when system invariants – e.g. number of orders shipped to the US in 5 min interval should be greater than 0 – have been violated.

Monitoring of micro-services is an interesting topic because once again, the traditional approach to monitoring – alert me when CPU/network or latencies go above threshold – is no longer sufficient because of the complex effects of causality that runs through your inter-dependent services.

Instead, as Richard Rodger pointed out in his Measuring Micro-Services talk at Codemotion Rome this year, it’s far more effective to identify and monitor emerging properties instead.

If you live in Portland then it might be worth going to Monitorama in June, it’s a conference that focuses on monitoring. Last year’s conference had some pretty awesome talks.

IO Explosions

I didn’t get a clear sense of how it works, but Adrian mentioned that Gilt has looked to lambda architecture for their critical path code, and are doing pre-computation and real-time push updates to reduce the number of IO calls that occur between services.

Reporting

Since they have many databases, it means their data analysts would need access to all the databases in order to run their reports. To simplify things, they put their analytics data into a queue which is then written to S3.

Whilst Adrian didn’t divulge on what technology Gilt is using for analytics, there are quite a number of tools/services available to you on the market.

 

Based on our experience at Gamesys Social where we have been using Google BigQuery for a number of years, I strongly recommend that you take it into consideration. It’s a managed service that allows you to run SQL-like, ad-hoc queries on Exabyte-size datasets and get results back in seconds.

At this point we have around 100TB of analytics data at Gamesys Social and our data analysts are able to run their queries everyday and analyse 10s of TBs of data without any help from the developers.

BigQuery is just one of many interesting technologies that we use at Gamesys Social, so if you’re interested in working with us, we’re hiring for a functional programmer right now.

 

 

Links