Yubl’s road to Serverless architecture — overview

part 1: overview <- you’re here

part 2: test­ing and CI/CD

part 3: ops

part 4: build­ing a scal­able noti­fi­ca­tion sys­tem

part 5: build­ing a bet­ter rec­om­men­da­tion sys­tem


Since Yubl’s clo­sure quite a few peo­ple have asked about the server­less archi­tec­ture we end­ed up with and some of the things we have learnt along the way.

As such, this is the first of a series of posts where I’d share some of the lessons we learnt. How­ev­er, bear in mind the pace of change in this par­tic­u­lar space so some of the challenges/problems we encoun­tered might have been solved by the time you read this.

ps. many aspects of this series is already cov­ered in a talk I gave on Ama­zon Lamb­da at Leet­s­peak this year, you can find the slides and record­ing of the talk here.

From A Monolithic Beginning

Back when I joined Yubl in April I inher­it­ed a mono­lith­ic Node.js back­end run­ning on EC2 instances, with Mon­go­Lab (host­ed Mon­goDB) and CloudAMQP (host­ed Rab­bit­MQ) thrown into the mix.

yubl-monolith

There were numer­ous prob­lems with the lega­cy sys­tem, some could be rec­ti­fied with incre­men­tal changes (eg. blue-green deploy­ment) but oth­ers required a rethink at an archi­tec­tur­al lev­el. Although things look real­ly sim­ple on paper (at the archi­tec­ture dia­gram lev­el), all the com­plex­i­ties are hid­den inside each of these 3 ser­vices and boy, there were com­plex­i­ties!

My first tasks were to work with the ops team to improve the exist­ing deploy­ment pipeline and to draw up a list of char­ac­ter­is­tics we’d want from our archi­tec­ture:

  • able to do small, incre­men­tal deploy­ments
  • deploy­ments should be fast, and requires no down­time
  • no lock-step deploy­ments
  • fea­tures can be deployed inde­pen­dent­ly
  • fea­tures are loose­ly cou­pled through mes­sages
  • min­imise cost for unused resources
  • min­imise ops effort

From here we decid­ed on a ser­vice-ori­ent­ed archi­tec­ture, and Ama­zon Lamb­da seemed the per­fect tool for the job giv­en the work­loads we had:

  • lots of APIs, all HTTPS, no ultra-low laten­cy require­ment
  • lots of back­ground tasks, many of which has soft-real­time require­ment (eg. dis­trib­ut­ing post to follower’s time­line)

To a Serverless End

It’s suf­fice to say that we knew the migra­tion was going to be a long road with many chal­lenges along the way, and we want­ed to do it incre­men­tal­ly and grad­u­al­ly increase the speed of deliv­ery as we go.

The lead time to some­one say­ing thank you is the only rep­u­ta­tion met­ric that mat­ters”

- Dan North

The first step of the migra­tion was to make the lega­cy sys­tems pub­lish state changes in the sys­tem (eg. user joined, user A fol­lowed user B, etc.) so that we can start build­ing new fea­tures on top of the lega­cy sys­tems.

To do this, we updat­ed the lega­cy sys­tems to pub­lish events to Kine­sis streams.


Our gen­er­al strat­e­gy is:

  • build new fea­tures on top of these events, which usu­al­ly have their own data stores (eg. DynamoDB, Cloud­Search, S3, Big­Query, etc.) togeth­er with back­ground pro­cess­ing pipelines and APIs
  • extract exist­ing features/concepts from the lega­cy sys­tem into ser­vices that will run side-by-side
    • these new ser­vices will ini­tial­ly be backed by the same shared Mon­go­Lab data­base
    • oth­er ser­vices (includ­ing the lega­cy ones) are updat­ed to use hand-craft­ed API clients to access the encap­su­lat­ed resources via the new APIs rather than hit­ting the shared Mon­go­Lab data­base direct­ly
    • once all access to these resources are done via the new APIs, data migra­tion (usu­al­ly to DynamoDB tables) will com­mence behind the scenes
  • wher­ev­er pos­si­ble, requests to exist­ing API end­points are for­ward­ed to the new APIs so that we don’t have to wait for the iOS and Android apps to be updat­ed (which can take weeks) and can start reap­ing the ben­e­fits ear­li­er

After 6 months of hard work, my team of 6 back­end engi­neers (includ­ing myself) have dras­ti­cal­ly trans­formed our back­end infra­struc­ture. Ama­zon was very impressed by the work we were doing with Lamb­da and in the process of writ­ing up a case study of our work when Yubl was shut down at the whim of our major share­hold­er.

Here’s an almost com­plete pic­ture of the archi­tec­ture we end­ed up with (some details are omit­ted for brevi­ty and clar­i­ty).

overall

Some inter­est­ing stats:

  • 170 Lamb­da func­tions run­ning in pro­duc­tion
  • rough­ly 1GB of total deploy­ment pack­age size (after Jan­i­tor Lamb­da cleans up unref­er­enced ver­sions)
  • Lamb­da cost was around 5% of what we pay for EC2 for a com­pa­ra­ble amount of com­pute
  • the no. of pro­duc­tion deploy­ments increased from 9/month in April to 155 in Sep­tem­ber

For the rest of the series I’ll drill down into spe­cif­ic fea­tures, how we utilised var­i­ous AWS ser­vices, and how we tack­led the chal­lenges of:

  • cen­tralised log­ging
  • cen­tralised con­fig­u­ra­tion man­age­ment
  • dis­trib­uted trac­ing with cor­re­la­tion IDs for Lamb­da func­tions
  • keep­ing Lamb­da func­tions warm to avoid cold­start penal­ty
  • auto-scal­ing AWS resources that do not scale dynam­i­cal­ly
  • auto­mat­i­cal­ly clean up old Lamb­da func­tion ver­sions
  • secur­ing sen­si­tive data (eg. mon­godb con­nec­tion string, ser­vice cre­den­tials, etc.)

I can also explain our strat­e­gy for test­ing, and running/debugging func­tions local­ly, and so on. If there’s any­thing you’d like me to cov­er in par­tic­u­lar, please leave a com­ment and let me know.

Links