There were a couple of micro-services related talks at this year’s edition of CraftConf. The first was by Adrian Trenaman of Gilt, who talked about their journey from a monolithic architecture to micro-services, and from self-managed datacentres to the cloud.
From Monolith to Micro-Services
They started off with a Ruby on Rails monolithic architecture in 2007, but quickly grew to a $400M business within 4 years.
With that success came the scalability challenge, one that they couldn’t meet with their existing architecture.
It was at this point that they started to split up their monolith.
A number of smaller, but still somewhat monolithic services were created – product service, user service, etc. But most of the services were still sharing the same Postgres database.
The cart service was the only one that had its own dedicated datastore, and this proved to be a valuable lesson to them. In order to evolve services independently you need to severe the hidden coupling that occurs when services share the same database.
Adrian also pointed out that, whilst some of the backend services have been split out, the UI applications (JSP pages) were still monoliths. Lots business logic such as promos and offers were hardcoded and hard/slow to change.
Another pervasive issue is that services are all loosely typed – everything’s a map. This introduced plenty of confusion around what’s in the map. This was because lots of their developers still hadn’t made the mental transition to working in a statically typed language.
Fast forward to 2015, and they’re now creating micro-services at a much faster rate using primarily Scala and the Play framework.
They have left the original services alone as they’re well understood and haven’t had to change frequently.
The front end application has been broken up into a set of granular, page-specific applications.
There has also been significant cultural changes:
- emergent architecture design rather than top-down
- technology decisions are driven by KPI where appropriate; or simple goals where there is no measurable KPIs
- fail fast, and be open about it so that lessons can be shared amongst the organization; they even hold regular meetings to allow people to openly discuss their failures
To the Cloud
Alongside the transition to micro-services, Gilt also migrated to AWS using a hybrid approach via Amazon VPC.
Every team has its own AWS account as well as a budget. However, organization of teams can change from time to time, but with this setup it’s difficult to move services around different AWS accounts.
One thing to note about Gilt’s migration to micro-services is that it took a long time and they have taken an incremental approach.
Adrian explained this as down to them prioritizing getting into market and extracting real business values over technical evolution.
- it lessens inter-dependencies between teams
- faster code-to-production
- allows lots of initiatives in parallel
- allows different language/framework to be used by each team
- allows graceful degradation of services
- allows code to be easily disposable – easy to innovate, fail and move on; Greg Young also touched on this in his recent talk on optimizing for deletability
Challenges with Micro-Services
Adrian listed 7 challenges his team came across in their journey.
They found it hard to maintain staging environments across multiple teams and services. Instead, they have come to believe that testing in production is the best way to go, and it’s therefore necessary to invest in automation tools to aid with doing canary releases.
I once heard Ben Christensen of Netflix talk about the same thing, that Netflix too has come to realize that the only way to test a complex micro-services architecture is to test it in production.
That said, I’m sure both Netflix and Gilt still have basic tests to catch the obvious bugs before they release anything into the wild. But these tests would not sufficiently test the interaction between the services (something Dan North and Jessica Kerr covered in their opening keynote Complexity is Outside the Code).
To reduce the risk involved with testing in production, you should at least have:
- canary release mechanism to limit impact of bad release and;
- minimize time for roll back;
- minimize time to discovery for problems by having granular metrics (see ex-Netflix architect Adrian Cockcroft’s talk at Monitorama 14 on this)
Who owns the service, and what happens if the person who created the service moves onto something else?
Gilt’s decision is to have teams and departments own the service, rather than individual staff.
Gilt is building tools over Docker to give them elastic and fast provisioning. This is kinda interesting, because there are already a number of such tools/services available, such as:
It’s not clear what are missing from these that is driving Gilt to build their own tools.
They also make sure that the clients are totally decoupled from the service, and are dumb and have zero dependencies.
Audit & Alerting
In order to give your engineers full autonomy in production and still stay compliant, you need good auditing and alerting capabilities.
Gilt built a smart alerting tool called CAVE, which alerts you when system invariants – e.g. number of orders shipped to the US in 5 min interval should be greater than 0 – have been violated.
Monitoring of micro-services is an interesting topic because once again, the traditional approach to monitoring – alert me when CPU/network or latencies go above threshold – is no longer sufficient because of the complex effects of causality that runs through your inter-dependent services.
Instead, as Richard Rodger pointed out in his Measuring Micro-Services talk at Codemotion Rome this year, it’s far more effective to identify and monitor emerging properties instead.
I didn’t get a clear sense of how it works, but Adrian mentioned that Gilt has looked to lambda architecture for their critical path code, and are doing pre-computation and real-time push updates to reduce the number of IO calls that occur between services.
Since they have many databases, it means their data analysts would need access to all the databases in order to run their reports. To simplify things, they put their analytics data into a queue which is then written to S3.
Whilst Adrian didn’t divulge on what technology Gilt is using for analytics, there are quite a number of tools/services available to you on the market.
Based on our experience at Gamesys Social where we have been using Google BigQuery for a number of years, I strongly recommend that you take it into consideration. It’s a managed service that allows you to run SQL-like, ad-hoc queries on Exabyte-size datasets and get results back in seconds.
At this point we have around 100TB of analytics data at Gamesys Social and our data analysts are able to run their queries everyday and analyse 10s of TBs of data without any help from the developers.
BigQuery is just one of many interesting technologies that we use at Gamesys Social, so if you’re interested in working with us, we’re hiring for a functional programmer right now.
- CraftConf 15 takeaways
- Greg Young – the art of destroying software
- Takeaways from Adam Tornhill – Code as Crime Scene at QCON London
- Canary Release
- Adrian Cockcroft @ Monitorama PDX 2014
- A false choice : Microservices or Monoliths
- Microservices – not a free lunch!
- Richard Rodger – Measuring micro-services
I specialise in rapidly transitioning teams to serverless and building production-ready services on AWS.
Are you struggling with serverless or need guidance on best practices? Do you want someone to review your architecture and help you avoid costly mistakes down the line? Whatever the case, I’m here to help.
Check out my new course, Complete Guide to AWS Step Functions. In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. Including basic concepts, HTTP and event triggers, activities, callbacks, nested workflows, design patterns and best practices.
Here is a complete list of all my posts on serverless and AWS Lambda. In the meantime, here are a few of my most popular blog posts.
- Lambda optimization tip – enable HTTP keep-alive
- You are thinking about serverless costs all wrong
- Many faced threats to Serverless security
- We can do better than percentile latencies
- I’m afraid you’re thinking about AWS Lambda cold starts all wrong
- Yubl’s road to Serverless
- AWS Lambda – should you have few monolithic functions or many single-purposed functions?
- AWS Lambda – compare coldstart time with different languages, memory and code sizes
- Guys, we’re doing pagination wrong