CraftConf 15–Takeaways from “Microservice AntiPatterns”

This is anoth­er good talk on micro-ser­vices at Craft­Conf, where Tam­mer Saleh talks about com­mon antipat­terns with micro-ser­vices and some of the ways you can avoid them.

Per­son­al­ly I think it’s great that both Tam­mer and Adri­an have spent a lot of time talk­ing about chal­lenges with micro-ser­vices at Craft­Conf. There has been so much hype around micro-ser­vices that many peo­ple for­get that like most things in tech­nol­o­gy, micro-ser­vice is an archi­tec­tur­al style that comes with its own set of trade-offs and not a free lunch.


Why Micro-Services?

Mono­lith­ic appli­ca­tions are hard to scale, and it’s impos­si­ble to scale the engi­neer­ing team around it. Because every­body has to work on the same code­base and the whole appli­ca­tion is deployed and updat­ed as one big unit.

Then came SOA, which had the right inten­tions but was too focused on hor­i­zon­tal mid­dle­ware instead of func­tion­al ver­ti­cals. It was also beset by greedy con­sul­tants and mid­dle­ware ven­dors.

And final­ly we have arrived at the age of micro-ser­vices, which embraces open­ness, mod­u­lar­i­ty and main­tain­abil­i­ty. And most impor­tant­ly, it’s about the teams and gives you a way to scale your engi­neer­ing orga­ni­za­tion.

With­in a micro-ser­vice archi­tec­ture you can have many small, focused teams (e.g. Amazon’s 2 piz­za rule) that have auton­o­my over tech­nolo­gies, lan­guages, deploy­ment process­es and be able to prac­tice devops and own their ser­vices in pro­duc­tion.



But, there are plen­ty of ways for things to go wrong. And here are some com­mon mis­takes.


Overzealous Services

The most com­mon mis­take is to go straight into micro-ser­vices (fol­low­ing the hype per­haps?) rather than start­ing with the sim­plest thing pos­si­ble.

Remem­ber, bor­ing is good.

Microser­vices are com­plex and it adds a con­stant tax to your devel­op­ment teams. It allows you to scale your devel­op­ment team very large, but it slows each of the teams down, and increas­es com­mu­ni­ca­tion over­head.

Instead, you should start by doing the sim­plest thing pos­si­ble, iden­ti­fy the hotspots as they become appar­ent and then extract them into micro-ser­vices (this is also how Google does it).


Schemas Everywhere

Anoth­er com­mon mis­take is to have a shared data­base between dif­fer­ent ser­vices. This cre­ates tight cou­pling between the ser­vices and means you can’t deploy the micro-ser­vices inde­pen­dent­ly if it breaks the shared schema. Any schema updates will force you to update oth­er micro-ser­vices too.

Instead, every ser­vice should have its own data­base, and when you want data from anoth­er ser­vice you go through its API rather than reach­ing into its data­base and help your­self.

But then, ser­vices still have to talk to one anoth­er and there’s still a schema in play here at the com­mu­ni­ca­tion lay­er. If you intro­duce a new ver­sion of your API with break­ing changes then you’ll break all the ser­vices that depend on you.

A solu­tion here would be to use seman­tic ver­sion­ing and deploy break­ing changes side by side. You can then give your con­sumers a dep­re­ca­tion peri­od, and get rid of the old ver­sion when no one is using it.

BUT, as any­one who’s ever had to dep­re­cate any­thing would tell you, doesn’t mat­ter how much notice you give some­one they’ll always wait till the last pos­si­ble moment.

Face­book has (or had, not sure if this is still the case) an inge­nious solu­tion to this prob­lem. As the dead­line for dep­re­ca­tion nears they’ll occa­sion­al­ly route some per­cent­age of your calls to the new API. This rais­es suf­fi­cient no. of errors in your appli­ca­tion to get your atten­tion with­out severe­ly com­pro­mis­ing your busi­ness. It could start off with 10% of calls in a 5 minute win­dow once every oth­er day, then once a day, then a 10 minute win­dow, and so on.

The idea is sim­ple, to give you the push you need and a glimpse of what would hap­pen if you miss the dead­line!


Spiky Load between Services

You often get spiky traf­fic between ser­vices, and a com­mon solu­tion is to amor­tise the load by using queues between the ser­vices.


Hardcoded IPs and Ports

A com­mon sin that most of us have com­mit­ted. It gets us going quick­ly but then makes our lives hell when we need to configure/manage mul­ti­ple envi­ron­ments and deploy­ments.

One solu­tion would be to use a dis­cov­ery ser­vice such as con­sul or etcd, and there’s also Netflix’s eure­ka.

In this solu­tion, each ser­vice would first con­tact the dis­cov­ery ser­vice to find out where oth­er ser­vices are:


Ser­vice A’ll use that address until Ser­vice B can no longer be reached at the address or some time­out has occurred, in which case it’ll ask the dis­cov­ery ser­vice for Ser­vice B’s loca­tion again.

On a side note, the Red-White Push mech­a­nism I worked on a while back works along sim­i­lar lines. It solves a dif­fer­ent prob­lem (routes client apps to cor­re­spond­ing ser­vice clus­ter based on ver­sion) but the idea behind it is the same.



Anoth­er solu­tion is to use a cen­tralised router.


In this solu­tion, ser­vices access each oth­er via a known URL via the router and you can eas­i­ly change the routes pro­gram­mat­i­cal­ly.


Both solu­tions require reg­is­tra­tion and dereg­is­tra­tion, and both require high avail­abil­i­ty and scal­a­bil­i­ty.

A dis­cov­ery ser­vice is sim­pler to build and scale, since it doesn’t need to route all traf­fic and so doesn’t need to scale as much. But an appli­ca­tion needs to know about the dis­cov­ery ser­vice in order to talk to it, so it requires more work to inte­grate with.

A router on the oth­er hand,  needs to han­dle all traf­fic  in your sys­tem and it incurs an addi­tion­al round-trip for every call. But it works trans­par­ent­ly and is there­fore eas­i­er to inte­grate and work with, and it can also be exposed exter­nal­ly too.



If one of your ser­vices is under load or mal­func­tion­ing, and all your oth­er ser­vices keep retry­ing their failed calls, then the prob­lem would be com­pound­ed and mag­ni­fied by the addi­tion­al load from these retries.

Even­tu­al­ly, the fal­ter­ing ser­vice will surfer a total out­age, and poten­tial­ly set off a cas­cad­ing fail­ure through­out your sys­tem.

The solu­tion here is to have expo­nen­tial back­off and imple­ment the cir­cuit break­er pat­tern.

Both pat­terns come high up the list of good pat­terns in Michael Nygard’s Release It! and adopt­ed by many libraries such as Netflix’s Hys­trix and Pol­ly for .Net.


You can even use the ser­vice dis­cov­ery lay­er to help prop­a­gate the mes­sage to all your ser­vices that a cir­cuit has been tripped (i.e. a ser­vice is strug­gling).

This intro­duces an ele­ment of even­tu­al con­sis­ten­cy (local view of the service’s health vs what the dis­cov­ery ser­vice is say­ing). Per­son­al­ly, I find it suf­fi­cient to let each ser­vice man­age its own cir­cuit break­er states so longer you have sen­si­ble set­tings for:

  • time­out
  • no. of timed out requests required to trip the cir­cuit
  • amount of time the cir­cuit is bro­ken for before allow­ing one attempt through


Using a dis­cov­ery ser­vice does have the advan­tage of allow­ing you to man­u­al­ly trip the cir­cuit, e.g. in antic­i­pa­tion to a release where you need some down time.

But, hope­ful­ly you won’t need that because you have already invest­ed in automat­ing your deploy­ment and can deploy new ver­sions of your ser­vice with­out down time Winking smile


Debugging Hell

Debug­ging is always a huge issue in micro-ser­vice archi­tec­tures. E.g., a nest­ed ser­vice call fails and you can’t cor­re­late it back to a request com­ing from the user.

One solu­tion is to use cor­re­la­tion IDs.

When a request comes in, you assign the request with a cor­re­la­tion ID and pass it on to oth­er ser­vices in the HTTP request head­er. Every ser­vice would do the same and pass the cor­re­la­tion ID it receives in the incom­ing HTTP head­er in any out-going requests. When­ev­er you log a mes­sage, be sure to include this cor­re­la­tion ID in the log line.

It’s a sim­ple pat­tern, but it’s one that you’ll have to apply in every one of your ser­vices and dif­fi­cult to trans­par­ent­ly imple­ment in the entire dev team.

I don’t know how it’d look but I imag­ine you might be to able to auto­mate this using an AOP frame­work like Post­Sharp. Alter­na­tive­ly, if you’re using some form of scaf­fold­ing when you start work on a new ser­vice, maybe you can include some mech­a­nism to cap­ture and for­ward cor­re­la­tion IDs.


Missing Mock Servers

When you have a ser­vice that oth­er teams depend on, each of these teams would have to mock and stub your ser­vice in order to test their own ser­vices.

A good step in the right direc­tion is for you to own a mock ser­vice and pro­vide it to con­sumers. To make inte­gra­tion eas­i­er, you can also pro­vide the client to your ser­vice, and sup­port using the mock ser­vice via con­fig­u­ra­tion. This way, you make it easy for oth­er teams to con­sume your ser­vice, and to test their own ser­vices they just need to con­fig­ure the client to hit the mock ser­vice instead.

You can take it a step fur­ther, by build­ing the mock ser­vice into the client so you don’t even have to main­tain a mock ser­vice any­more. And since it’s all trans­par­ent­ly done in the client, your con­sumers would be none the wis­er.

This is a good pat­tern to use if there is a small num­ber of lan­guages you need to sup­port with­in your orga­ni­za­tion.


To help you get start­ed quick­ly, I rec­om­mend look­ing at Mulesoft’s Any­point Plat­form. It comes with an API design­er which allows you to design your API using RAML and there’s a built-in mock ser­vice sup­port that updates auto­mat­i­cal­ly as you save your RAML file.


It’s also a good way to doc­u­ment your API. There’s also a live API por­tal which is:

  • inter­ac­tive — you can try out the end­points and make requests against the mock ser­vice
  • live — it’s auto­mat­i­cal­ly updat­ed as you update your API in RAML
  • good for doc­u­ment­ing your API and shar­ing it with oth­er teams


We have been using it for some time and it has proven to be a use­ful tool to allow us to col­lab­o­rate and iter­a­tive­ly design our APIs before we put our heads down and code. We have been able to iden­ti­fy infor­ma­tion needs not being met and imped­ance mis­match­es ear­ly and reduced the amount of back-and-forth required.


Flying Blind

With a micro-ser­vice archi­tec­ture the need for on-going oper­a­tional met­rics is greater than ever before, with­out them you are just fly­ing blind.


There are more than a hand­ful of tools avail­able in this space. There are com­mer­cial tools (some with free tiers) such as NewRel­ic and Stack­Driv­er (now inte­grat­ed into Google AppEn­gin), AWS also offers Cloud­Watch as part of its ecosys­tem. In the open source space, Net­flix has been lead­ing the way with Hys­trix and  some­thing even more excit­ing.

The gran­u­lar­i­ty of the met­rics is also impor­tant. When your met­rics are mea­sured in 1 minute inter­vals (e.g. Cloud­Watch) then your time to dis­cov­ery of prob­lems could be in the 10–15 min­utes range, which push­es time to recov­er even fur­ther out. This is why Net­flix built its own mon­i­tor­ing sys­tem called Atlas.


Also, you get this explo­sion of tech­nolo­gies from lan­guages, run­times and data­bas­es, which can be very painful from an oper­a­tions point of view.

When you intro­duce con­tain­ers, this prob­lem is com­pound­ed even fur­ther because con­tain­ers are so cheap and each VM can run 100s even 1000s of con­tain­ers, each poten­tial­ly run­ning a dif­fer­ent tech­nol­o­gy stack…

The orga­ni­za­tion­al solu­tion here would be to form a team to build tools that enable devel­op­ers to man­age the sys­tem in an entire­ly auto­mat­ed way. This is exact­ly what Net­flix has done and a lot of amaz­ing tools have come out of there and lat­er open sourced.



  • Start bor­ing and extract to ser­vices
  • Under­stand the hid­den schemas
  • Amor­tize traf­fic with queues
  • Decou­ple through dis­cov­ery tools
  • Con­tain fail­ures with cir­cuit break­ers
  • Use mock­able clients.
  • Build in oper­a­tional met­rics from the begin­ning