QCon London 2015–Takeaways from “Scaling Uber’s realtime market platform”

On day three of QCon London, we were treated to some really insightful stories from the likes of Google, Atlas and Spotify. And for the first time in a while Uber is talking publically about what they’ve been up to.

 

The challenge for Uber’s platform is that both supply (Uber drivers) and demand (riders) are dynamic and matching them efficiently in real-time is not easy.

Uber’s services are written in a mixture of Node.js, Python, Java and Go, whilst a whole mix of databases are used – PostgreSQL, Redis, MySQL and Riak.

 

From a high level, they have a number of backend components:

image

They recently rewrote the Dispatch system despite Joel Spolsky advising against complete rewrites. To that, Matt Ranney said there are a number of built-in assumptions in the current dispatch system that is so deep-rooted that a revolutionary step is more efficient and productive:

  • assumes 1 rider per vehicle, hard to support vehicle pooling
  • the idea of moving people is baked into domain and code, making it hard to move into new markets (Matt didn’t elaborate, but I assume transportation of goods might be one such market)
  • sharding by city, which is not a sustainable approach as Uber moves into more and more cities
  • multiple points of failure that can bring everything down

The dispatch system was hard to fix incrementally, and since everything runs as a service, it was feasible to replace the existing system outright.

 

The new dispatch system looks like this:

image

where DISCO stands for DISpatCh Optimization service.

 

For geo-indexing, the dispatch service needs to know not only the physical distance between supply and demand, but also ETA based on historical travel data. The ETA calculation also needs to handle a number of special cases, including airports, where demands need to be queued (i.e. first come first served) to provide a fair service to everyone waiting at an airport.

The old system can only track available supplies (i.e. cars with no riders), which means there are missed optimization opportunities such as the following:

image

where the demand (D1) can be met by an in-flight supply (S2) earlier than an available supply (S1).

DISCO is able to consider supplies that are currently in-flight and project their route into the future and take that into the matching process, and supports vehicle pooling (if both D1 and D2 agrees to share a vehicle):

image

 

Uber breaks up the earth into tiny cells (like in Google Maps) and each is given a unique ID. Using the Google S2 library, you can identify cells that will completely cover a shape you’ve supplied:

image

Uber uses these cell IDs as sharding key to update supply, and when DISCO needs to match supply to demand, you can use that information to find supplies that are in the matching cells.

A limitation with this approach is that the cells have fixed size, so one would imagine the update activities are not well spread out through the key space. It’s natural for supply and demand to be concentrated around  city centres where the night life is – central London being a prime example.

Nonetheless, the goal of the routing is to:

  • reduce wait time for riders
  • reduce extra driving for drivers
  • lower overall ETAs

 

In order to scale their services, Uber went with an approach of building stateful services using Node.js. In addition, they also introduced a custom RPC protocol called ringpop, which is based on the SWIM paper. Ringpop also runs on its own TChannel multiplexing and framing protocol.

The goal of these projects is to provide:

  • performance across different languages
  • high performance request forwarding
  • proper pipelining
  • support for checksums and tracing
  • encapsulation

 

On a high-level, nodes in a cluster is able to handle any request, and if the data is not available on the node then the request is forwarded to the correct node.

image

This essentially deals with the need for managing consistent hashing on the client.

 

For Uber, availability is of paramount importance, as the cost of switching to competitor is low. So they decided to:

  • make everything retryable, which means making every operation idempotent (something which I suspect can be challenging in practice)
  • make everything killable (chaos monkey style), ringpop detects failed nodes and remove them from the cluster
  • crash only, no complicated graceful shutdowns
  • break things up into small pieces

which in turn required some cultural changes:

  • no pairs (I think he was talking about read-replica setups where there’s a potentially complicated fallover process)
  • kill everything, even databases

 

Since service talk to each other via load balancers, so you will need to be able to kill load balancers too, so instead load balancer logic is put in the service client (similar to Netflix Ribbon from what I gathered). I didn’t buy Matt’s rationale here since it’s possible to make load balancers highly available too, but then he also mentions the ability to do smarter routing – choosing data centre with better latency in a globally deployed infrastructure for example – which makes more sense.

 

Matt then went on to talk about some of the challenges with large fanout services, and in particular, the challenge with getting predictable latency when a large number of services are involved.

image

He also referenced Google fellow Jeff Dean’s paper Achieving Rapid Response Times in Large Online Services which is a great read, slide 39-70 describes the approach Uber has adopted.

image

In the example above, the following happened:

  1. service A sends req 1 to service B (1), informing it that the request will also be sent to service B (2)
  2. 5ms later, service A indeed sends the same request to service B (2), which goes into its backlog, service B (2) also finds out that service B (1) also got the same request
  3. meanwhile, service B (1) starts to process the request, sends a signal to service B (2) to cancel req 1 from its backlog
  4. service B (1) completes the request and replies to service A

If service B (1) was under load and couldn’t process the request fast enough then service B (2) would have processed the request and replied to service A, unless of course service B (2) is also under load.

In case you’re worried about the extra requests that would need to be processed with this approach, Jeff Dean paper (above) has the following results to show:

image

A more naive approach would be to always send the request to both service B (1) and service B (2) and just ignore the slower response. Based on a previous talk I watch this is (at least was) what Netflix does.

 

Finally, Matt touched on how Uber deals with datacentre outages. Their approach is quite simple and effective:

image

In this example, when the mobile app sends a location update, the service will respond with an encrypted state digest. When datacentre 1 fails:

  1. app will send the location updates to datacentre 2 instead
  2. since datacentre 2 doesn’t have the user’s state, so it requests the last state digest the app has received
  3. the app then sends the encrypted state digest in datacentre 2
  4. datacentre 2 decrypts the digest and initialize the user state
  5. now the app can converse with data centre 2 normally

 

Links

Slides for the talk

Ringpop project page

TChannel project page

SWIM : Scalable Weakly-consistent Infection-style process group Membership protocol

Jeff Dean – Achieving rapid response times in large online services

QCon London 2015–Takeaways from “Service Architectures at Scale, Lessons from Google and eBay”

Day three of QCon London was a treat, with full day tracks on architecture and microservices, it presented some nice challenges of what to see during the day.

My favourite talk of the day was Randy Shoup’s Service Architectures at Scale, Lessons from Google and eBay.

 

Randy kicked off the session by identifying a common trend in the architecture evolution at some of the biggest internet companies.

image

An ecosystem of microservices also differ from their monolithic counterparts in that they tend to organically form many layers of dependencies rather than fall into strict tiers in a hierarchy.

 

At Google, there has never been a top-down design approach to building systems, but rather an evolutionary process using natural selection – services survive  by justifying their existence through usage or they are deprecated. What appears to be a clean layering by design turned out to be an emergent property of this approach.

image

Services are built from bottom-up but you can still end up with clean, clear separation of concerns.

 

At Google, there are no “architect” roles, nor is there a central approval process for technology decisions. Most technology decisions are made within the team, so they’re empowered to make the decisions that are best for them and their service.

This is in direct contrast to how eBay operated early on, where there was an architecture review board which acted as a central approval body.

 

Even without the presence of a centralized control body, Google proved that it’s still possible to achieved standardization across the organization.

Within Google, communication methods (e.g.. network protocol, data format, structured way of expressing interface, etc.) as well as common infrastructure (source control, monitoring, alerting, etc.) are standardized by encouragement rather than enforcement.

image

By the sound of it, best practices and standardization are achieved through a consensus-based approach in teams and then spread out  throughout the organization through:

  • encapsulation in shared/reusable libraries;
  • support for these standards in underlying services;
  • code reviews (word of mouth);
  • and most importantly the ability to search all of Google’s code to find existing examples

One drawback with following existing examples is the possibility of random anchoring – someone at one point made a decision to do things one way and then that becomes the anchor for everyone else who finds that example thereafter.

image

image

Whilst the surface areas of services are standardized, the internals of the services are not, leaving developers to choose:

  • programming language (C++, Go, Python or Java)
  • frameworks
  • persistence mechanisms

image

 

Rather than deciding on the split of microservices up ahead, capabilities tend to be implemented in existing services first to solve specific problems.

If it prove to be successful then it’s extracted out and generalized as a service of its own with a new team formed around it. Many popular services today all started life this way – Gmail, App Engine and BigTable to name a few.

 

On the other hand, a failed service (e.g. Google Wave) will be deprecated but reusable technology would be repurposed and the people in the team would be redeployed to other teams.

 

This is a fairly self-explanatory slide and an apt description of what a microservice should look like.

image

 

As the owner of a service, your primary focus should be the needs of your clients, and to meet their needs at minimum cost and effort. This includes leveraging common tools, infrastructures and existing service as well as automating as much as possible.

The service owner should have end-to-end ownership, and the mantra should be “You build it, you run it”.

The teams should have autonomy to choose the right technology and be held responsible for the results of those choices.

 

Your service should have a bounded context, its primary focus should be on the client and services that depend on the service.

You should not have to worry about the complete ecosystem or the underlying infrastructure, and this reduced cognitive load also means the teams can be extremely small (usually 3-5 people) and nimble. Having a small team also bounds the amount of complexity that can be created (i.e. use Conway’s law to your advantage).

 

Treat service-service relationship as a vendor-client relationship with clear ownership and division of responsibility.

To give people the right incentives, you should charge for usage of the service, this way, it aligns economic incentives for both sides to optimize for efficiency.

With a vendor-client relationship (with SLAs and all) you’re incentivized to reduce the risk that comes with making changes, hence pushing you towards making small incremental changes and employing solid development practices (code reviews, automated test, etc.).

 

You should never break your clients’ code, hence it’s important to keep backward/forward compatibility of interfaces.

You should provide an explicit deprecation policy and give your clients strong incentives to move off old versions.

 

Services at scale are highly exposed to performance variability.

image

Tail latencies (e.g. 95%, 99% latency) are much more important than average latencies. It’s easier for your client to program to consistent performance.

 

Services at scale are also highly exposed to failures.

(disruptions are 10x more likely from human errors than software/hardware failures)

You should have resilience in depth with redundancy for hardware failures, and have capability for incremental deployments:

  • Canary releases
  • Staged rollouts
  • Rapid rollbacks

eBay also use ‘feature flags’ to decouple code deployment from feature deployment.

And of course, monitoring..

image

 

Finally, here are some anti-patterns to look out for:

Mega-Service – services that does too much, ala mini-monolith

Shared persistence – breaks encapsulation, and encourages ‘backdoor’ violation, can lead to hidden coupling of services (think integration via databases…)

 

Gamesys Social

As I sat through Randy’s session, I was surprised and proud to find that we have employed many similar practices in my team (backend team at Gamesys Social), a seal of approval if you like:

  • not having architect roles, instead using a consensus-based approach to make technology decisions
  • standardization via encouragement
  • allow you to experiment with approaches/tech and not penalizing you when things don’t pan out (the learning is also a valuable output from the experiment)
  • organic growth of microservices (proving them in existing services first before splitting out and generalize)
  • place high value on automation
  • autonomy to the team, and DevOps philosophy of “you build it, you run it”
  • deployment practices – canary release, staged rollouts, use of feature flags and our twist on the blue-green deployment

 

I’m currently looking for some functional programmers to join the team, so if this sounds like the sort of environment you would like to work in, then have a look at our job spec and apply!

st119-3992

 

Links

Slides for the talk

We’re hiring a Functional Programmer!