QCon London 2015–Takeaways from “Scaling Uber’s realtime market platform”

On day three of QCon Lon­don, we were treat­ed to some real­ly insight­ful sto­ries from the likes of Google, Atlas and Spo­ti­fy. And for the first time in a while Uber is talk­ing pub­li­cal­ly about what they’ve been up to.

 

The chal­lenge for Uber’s plat­form is that both sup­ply (Uber dri­vers) and demand (rid­ers) are dynam­ic and match­ing them effi­cient­ly in real-time is not easy.

Uber’s ser­vices are writ­ten in a mix­ture of Node.js, Python, Java and Go, whilst a whole mix of data­bas­es are used – Post­greSQL, Redis, MySQL and Riak.

 

From a high lev­el, they have a num­ber of back­end com­po­nents:

image

They recent­ly rewrote the Dis­patch sys­tem despite Joel Spol­sky advis­ing against com­plete rewrites. To that, Matt Ran­ney said there are a num­ber of built-in assump­tions in the cur­rent dis­patch sys­tem that is so deep-root­ed that a rev­o­lu­tion­ary step is more effi­cient and pro­duc­tive:

  • assumes 1 rid­er per vehi­cle, hard to sup­port vehi­cle pool­ing
  • the idea of mov­ing peo­ple is baked into domain and code, mak­ing it hard to move into new mar­kets (Matt didn’t elab­o­rate, but I assume trans­porta­tion of goods might be one such mar­ket)
  • shard­ing by city, which is not a sus­tain­able approach as Uber moves into more and more cities
  • mul­ti­ple points of fail­ure that can bring every­thing down

The dis­patch sys­tem was hard to fix incre­men­tal­ly, and since every­thing runs as a ser­vice, it was fea­si­ble to replace the exist­ing sys­tem out­right.

 

The new dis­patch sys­tem looks like this:

image

where DISCO stands for DIS­patCh Opti­miza­tion ser­vice.

 

For geo-index­ing, the dis­patch ser­vice needs to know not only the phys­i­cal dis­tance between sup­ply and demand, but also ETA based on his­tor­i­cal trav­el data. The ETA cal­cu­la­tion also needs to han­dle a num­ber of spe­cial cas­es, includ­ing air­ports, where demands need to be queued (i.e. first come first served) to pro­vide a fair ser­vice to every­one wait­ing at an air­port.

The old sys­tem can only track avail­able sup­plies (i.e. cars with no rid­ers), which means there are missed opti­miza­tion oppor­tu­ni­ties such as the fol­low­ing:

image

where the demand (D1) can be met by an in-flight sup­ply (S2) ear­li­er than an avail­able sup­ply (S1).

DISCO is able to con­sid­er sup­plies that are cur­rent­ly in-flight and project their route into the future and take that into the match­ing process, and sup­ports vehi­cle pool­ing (if both D1 and D2 agrees to share a vehi­cle):

image

 

Uber breaks up the earth into tiny cells (like in Google Maps) and each is giv­en a unique ID. Using the Google S2 library, you can iden­ti­fy cells that will com­plete­ly cov­er a shape you’ve sup­plied:

image

Uber uses these cell IDs as shard­ing key to update sup­ply, and when DISCO needs to match sup­ply to demand, you can use that infor­ma­tion to find sup­plies that are in the match­ing cells.

A lim­i­ta­tion with this approach is that the cells have fixed size, so one would imag­ine the update activ­i­ties are not well spread out through the key space. It’s nat­ur­al for sup­ply and demand to be con­cen­trat­ed around  city cen­tres where the night life is – cen­tral Lon­don being a prime exam­ple.


Nonethe­less, the goal of the rout­ing is to:

  • reduce wait time for rid­ers
  • reduce extra dri­ving for dri­vers
  • low­er over­all ETAs

 

In order to scale their ser­vices, Uber went with an approach of build­ing state­ful ser­vices using Node.js. In addi­tion, they also intro­duced a cus­tom RPC pro­to­col called ring­pop, which is based on the SWIM paper. Ring­pop also runs on its own TChan­nel mul­ti­plex­ing and fram­ing pro­to­col.

The goal of these projects is to pro­vide:

  • per­for­mance across dif­fer­ent lan­guages
  • high per­for­mance request for­ward­ing
  • prop­er pipelin­ing
  • sup­port for check­sums and trac­ing
  • encap­su­la­tion

 

On a high-lev­el, nodes in a clus­ter is able to han­dle any request, and if the data is not avail­able on the node then the request is for­ward­ed to the cor­rect node.

image

This essen­tial­ly deals with the need for man­ag­ing con­sis­tent hash­ing on the client.

 

For Uber, avail­abil­i­ty is of para­mount impor­tance, as the cost of switch­ing to com­peti­tor is low. So they decid­ed to:

  • make every­thing retryable, which means mak­ing every oper­a­tion idem­po­tent (some­thing which I sus­pect can be chal­leng­ing in prac­tice)
  • make every­thing kil­l­able (chaos mon­key style), ring­pop detects failed nodes and remove them from the clus­ter
  • crash only, no com­pli­cat­ed grace­ful shut­downs
  • break things up into small pieces

which in turn required some cul­tur­al changes:

  • no pairs (I think he was talk­ing about read-repli­ca setups where there’s a poten­tial­ly com­pli­cat­ed fallover process)
  • kill every­thing, even data­bas­es

 

Since ser­vice talk to each oth­er via load bal­ancers, so you will need to be able to kill load bal­ancers too, so instead load bal­ancer log­ic is put in the ser­vice client (sim­i­lar to Net­flix Rib­bon from what I gath­ered). I didn’t buy Matt’s ratio­nale here since it’s pos­si­ble to make load bal­ancers high­ly avail­able too, but then he also men­tions the abil­i­ty to do smarter rout­ing – choos­ing data cen­tre with bet­ter laten­cy in a glob­al­ly deployed infra­struc­ture for exam­ple – which makes more sense.

 

Matt then went on to talk about some of the chal­lenges with large fanout ser­vices, and in par­tic­u­lar, the chal­lenge with get­ting pre­dictable laten­cy when a large num­ber of ser­vices are involved.

image

He also ref­er­enced Google fel­low Jeff Dean’s paper Achiev­ing Rapid Response Times in Large Online Ser­vices which is a great read, slide 39–70 describes the approach Uber has adopt­ed.

image

In the exam­ple above, the fol­low­ing hap­pened:

  1. ser­vice A sends req 1 to ser­vice B (1), inform­ing it that the request will also be sent to ser­vice B (2)
  2. 5ms lat­er, ser­vice A indeed sends the same request to ser­vice B (2), which goes into its back­log, ser­vice B (2) also finds out that ser­vice B (1) also got the same request
  3. mean­while, ser­vice B (1) starts to process the request, sends a sig­nal to ser­vice B (2) to can­cel req 1 from its back­log
  4. ser­vice B (1) com­pletes the request and replies to ser­vice A

If ser­vice B (1) was under load and couldn’t process the request fast enough then ser­vice B (2) would have processed the request and replied to ser­vice A, unless of course ser­vice B (2) is also under load.

In case you’re wor­ried about the extra requests that would need to be processed with this approach, Jeff Dean paper (above) has the fol­low­ing results to show:

image

A more naive approach would be to always send the request to both ser­vice B (1) and ser­vice B (2) and just ignore the slow­er response. Based on a pre­vi­ous talk I watch this is (at least was) what Net­flix does.

 

Final­ly, Matt touched on how Uber deals with dat­a­cen­tre out­ages. Their approach is quite sim­ple and effec­tive:

image

In this exam­ple, when the mobile app sends a loca­tion update, the ser­vice will respond with an encrypt­ed state digest. When dat­a­cen­tre 1 fails:

  1. app will send the loca­tion updates to dat­a­cen­tre 2 instead
  2. since dat­a­cen­tre 2 doesn’t have the user’s state, so it requests the last state digest the app has received
  3. the app then sends the encrypt­ed state digest in dat­a­cen­tre 2
  4. dat­a­cen­tre 2 decrypts the digest and ini­tial­ize the user state
  5. now the app can con­verse with data cen­tre 2 nor­mal­ly

 

Links

Slides for the talk

Ring­pop project page

TChan­nel project page

SWIM : Scal­able Weak­ly-con­sis­tent Infec­tion-style process group Mem­ber­ship pro­to­col

Jeff Dean – Achiev­ing rapid response times in large online ser­vices