A year in the cloud with AWS

Time real­ly does fly when you’re hav­ing fun! In a blink of the eye it’s been a whole year since I left Cred­it Suisse to start a career mak­ing social games with IwI. All and all, it’s been a year filled with lots of hard work, lots of learn­ing and hon­est­ly, a hell of a lot of fun! The great thing about doing some­thing com­plete­ly dif­fer­ent is that you get to learn about a new stack of tech­nolo­gies to tack­le an alto­geth­er dif­fer­ent set of chal­lenges, and in the end I believe it has made me a bet­ter devel­op­er.

One of those new tech­nolo­gies which I’ve had to get used to is Ama­zon Web Ser­vices (AWS), Amazon’s cloud com­put­ing solu­tion, which I will give an account of my per­son­al expe­ri­ences of  work­ing close with it dur­ing this past year.

By now you’ve prob­a­bly heard plen­ty about cloud com­put­ing already (well chances are one of the providers have tried to sell their ser­vice to you already!), so I won’t bore you with those vague mar­ket­ing tagline (well, I got noth­ing to sell here any­way :-P) and won­der­ful sto­ries of how com­pa­ny XYZ made a move to the cloud and are now able to serve mil­lions more cus­tomers whilst pay­ing peanuts to run their servers.

Well.. truth is, things aren’t always rosy and there had been sev­er­al high pro­file dis­as­ters which result­ed in com­pa­nies los­ing cred­i­bil­i­ty or in extreme cas­es, their entire business..(that’s what los­ing your entire data­base clus­ter does to you when you haven’t got ade­quate back­up…). So there are new risks involved and new chal­lenges which need to be tack­led.

AWS vs other cloud solutions

Whilst all my expe­ri­ence with cloud com­put­ing has been with AWS so I won’t com­ment on the pros and cons of oth­er cloud solu­tions, but from my under­stand­ing of Microsoft’s Azure and Google’s AppEngine ser­vices they have a very dif­fer­ent mod­el to that of AWS.

Azure and AppEngine’s mod­el is best described as plat­form-as-a-ser­vice where you devel­op your code against a SDK sup­plied by Microsoft or Google which allows you to make use of their respec­tive SQL/NoSQL solu­tions, etc. Deploy­ments are usu­al­ly easy and the ser­vice will man­age the num­ber of instances (servers) need­ed to run your appli­ca­tion to meet the cur­rent demand for you whilst let­ting you set a max/min cap on the num­ber of instances so that your cost doesn’t spi­ral out of con­trol.

One the oth­er hand, Amazon’s mod­el is infra­struc­ture-as-a-ser­vice which is basi­cal­ly the same as old­er vir­tu­al machine tech­nol­o­gy but with lots of addi­tion­al knots and bots such as auto­mat­ed pro­vi­sion­ing, auto scal­ing, auto­mat­ed billing, etc. You have the abil­i­ty to cre­ate new machine images from exist­ing instances to use to spawn new instances and you can use the instances for pret­ty much all intents and pur­pos­es, e.g. web servers, MySQL serv­er, cache clus­ter, etc. after all they’re just vir­tu­al machines.

In con­trast to Azure and AppEngine’s mod­el, AWS’s mod­el gives you more flex­i­bil­i­ty and con­trol but at the same time requires you to do a lit­tle bit more work to get going and take on the role of an IT pro as well as a devel­op­er.

Ven­dor Lock-in

Pos­si­bly the biggest com­plaint and wor­ry peo­ple have about migrat­ing to the cloud is that you are locked in with a par­tic­u­lar ven­dor. Whilst that’s true, but giv­en the cur­rent affairs and the way things are seem­ing­ly going, I’d say fears of the likes of Microsoft, Google or Ama­zon going bust and there­fore their respec­tive cloud ser­vices going up in smokes is.. well.. a lit­tle extreme.

It’s hard­er to move away from the Azure and AppEngine giv­en that every­thing you’ve cod­ed are against a par­tic­u­lar set of SDK but with AWS’s infra­struc­ture-as-a-ser­vice mod­el it’s pos­si­ble to migrate out of it with too much has­sle. In fact, Zyn­ga uses AWS as incu­ba­tion cham­ber for their new games and allow them to mon­i­tor and learn about the host­ing require­ments for a new game in its tur­bu­lent ear­ly days (well they lit­er­al­ly go from hun­dreds to mil­lions of users in the mat­ter of days so you can imag­ine…) before mov­ing the game into their own data cen­tres once the traf­fic sta­bilis­es.

Pric­ing

In terms of pric­ing, there’s lit­tle that sep­a­rates the three providers I’ve men­tioned here, in fact, the last time I looked, the equiv­a­lent instances in Azure costs the equiv­a­lent amount as their AWS coun­ter­part so clear­ly the mar­ket research divi­sions have been work­ing hard to know exact­ly what their com­peti­tors have been up to :-P

Matu­ri­ty

Hav­ing been released in 2006, AWS is one of the if not the old­est cloud ser­vice out there and even then it’s still bare­ly four years old and still a lit­tle rough around the edges (they don’t call it cut­ting edge for no rea­son I sup­pose). It has steadi­ly improved both in terms of fea­ture as well as tool­ing and there is an active com­mu­ni­ty out there build­ing sup­ple­men­tary tools/frameworks for it in dif­fer­ent lan­guages. I also find that the doc­u­men­ta­tions Ama­zon pro­vides are in gen­er­al up-to-date and use­ful.


Pop­u­lar­i­ty

Much to my sur­prise, I found out at a recent cloud com­put­ing con­fer­ence that none of Microsoft, Google or Ama­zon made the list of top 3 cloud com­put­ing ser­vice providers. Instead, SalesForce.com made num­ber one and I can’t remem­ber who the oth­er two providers are…sorry..

Services

AWS offers a whole ecosys­tem of dif­fer­ent ser­vices which should cov­er most aspects of a giv­en archi­tec­ture, from ser­vice host­ing to data stor­age, mes­sag­ing, they even recent­ly announced a new DNS ser­vice called Route 53!

There’s a quick run down of the three most impor­tant ser­vices which AWS offers:

Elas­tic Cloud Com­put­ing (EC2)

EC2 forms the back­bone of Amazon’s cloud solu­tion, its key char­ac­ter­is­tics include:

  • you pay for what you use at a per instance per hour rate
  • you pay for the amount of data trans­fers in and out of Ama­zon EC2 (data trans­fer between Ama­zon EC2 and oth­er Ama­zon Web Ser­vices in the same region is free)
  • there are a num­ber of dif­fer­ent OS’s to choose from includ­ing Lin­ux and Win­dows, Lin­ux instances are cheap­er to run with­out the license cost asso­ci­at­ed with Win­dows
  • there are a good range of dif­fer­ent instance types to choose from, from the most basic (sin­gle CPU, 1.7GB ram) to high per­for­mance instances (22 GB ram, 2 x Intel Xeon X5570, quad-core Nehalem, 2 x NVIDIA Tes­la Fer­mi M2050 GPUs)
  • a basic instance run­ning win­dows will cost you rough­ly $3 a day to run non-stop
  • default machine images are reg­u­lar­ly patched by Ama­zon
  • once you’ve set up your serv­er to be the way you want, includ­ing any serv­er updates/patches, you can cre­ate your own AMI (Ama­zon Machine Image) which you can then use to bring up oth­er iden­ti­cal instances
  • there’s a Elas­tic Load Bal­anc­ing ser­vice which pro­vides load bal­anc­ing capa­bil­i­ties at addi­tion­al cost (though most of the time you’ll only need one)
  • there’s a Cloud Watch ser­vice which you can enable on a per instance basis to help you mon­i­tor the CPU, net­work in/out, etc. of your instances, this ser­vice also has its own cost
  • you can use the Auto Scal­ing ser­vice to auto­mat­i­cal­ly bring up or ter­mi­nate instances based on some met­ric, e.g. ter­mi­nate 1 instance at a time if aver­age CPU uti­liza­tion across all instances is less than 50% for 5 min­utes con­tin­u­ous­ly, but bring up 1 new instance at a time if aver­age CPU goes beyond 70& for 5 min­utes
  • you can use the instances as web servers, DB servers, Mem­cached clus­ter, etc. choice is yours
  • round trips with­in Ama­zon is very very fast, but trips out of Ama­zon are sig­nif­i­cant­ly slow­er, there­fore the usu­al approach is to use Ama­zon Sim­pleDB (see below) or Ama­zon RDS as the Data­base (should you need one that is)
  • Ama­zonS­DK is pret­ty sol­id and con­tains enough class­es to help you write some cus­tom monitoring/scaling ser­vice if you ever need to but the AWS Man­age­ment Con­sole (see low­er down) lets you do most basic oper­a­tions any­way

Sim­pleDB

Amazon’s NoSQL data­base, it is a non-rela­tion­al, dis­trib­uted, key-val­ue data store, its key char­ac­ter­is­tics include:

  • com­pared to tra­di­tion­al rela­tion­al Data­bas­es it has low­er per-request per­for­mance, typ­i­cal 15–20ms oper­a­tions tend to take any­thing between 75–100ms to com­plete
  • in return, you get high scal­a­bil­i­ty with­out hav­ing to do any work
  • you pay for usage – how much work it takes to exe­cute your query
  • you start off with a sin­gle instance host­ing your data, instances are auto scaled up and down depend­ing on traf­fic, there’s no way to change the num­ber of Data­base instance to start off with
  • sup­ports a SQL-like query­ing syn­tax, though still fair­ly lim­it­ed
  • for .Net, MindScape’s Sim­pleDB Man­age­ment Tools is the best man­age­ment tool we’ve used, it inte­grates direct­ly into Visu­al Stu­dio and at $29 a head it’s not expen­sive either
  • most per­form­ing when traf­fic increases/decreases steadi­ly, there’s a notice­able slump in response times when there’s a sud­den surge of traf­fic as new instances takes around 10–15 mins to be ready to ser­vice requests
  • data are par­ti­tion into ‘domains’, which are equiv­a­lent to tables in a rela­tion­al Data­base
  • data are non-rela­tion­al, if you need a rela­tion­al mod­el then use Ama­zon RDS, I don’t have any expe­ri­ence with it so not the best per­son to com­ment on it
  • be aware of ‘even­tu­al con­sis­ten­cy’, data are dupli­cat­ed on mul­ti­ple instances after Ama­zon scales up your data­base to meet the cur­rent traf­fic, and syn­chro­niza­tion is not guar­an­teed when you do an update so it’s pos­si­ble though high­ly unlike­ly to update some data then read it back straight away and get the old data back
  • there are ‘con­sis­tent read’ and ‘con­di­tion­al update’ mech­a­nisms avail­able to guard against the even­tu­al con­sis­ten­cy prob­lem, if you’re devel­op­ing in .Net, I sug­gest using Sim­ple­Sa­vant client to talk to Sim­pleDB, it’s a fair­ly fea­ture-com­plete ORM for Sim­pleDB which already sup­ports both con­sis­tent reads and con­di­tion­al updates

Sim­ple Stor­age Ser­vice (S3)

Amazon’s stor­age ser­vice, again, extreme­ly scal­able, and safe too – when you save a file to S3 it’s repli­cat­ed across mul­ti­ple nodes so you get some DR abil­i­ty straight away. Many pop­u­late ser­vices such as Drop­Box uses it behind the scene and it’s also the stor­age of choice for many image host­ing ser­vices. Key char­ac­ter­is­tics include:

  • you pay for data trans­fer in and out (data trans­fer between EC2 and S3 in the same region is free)
  • files are stored against a key
  • you cre­ate ‘buck­ets’ to hold your files, and each buck­et has a unique URL (unique across all of Ama­zon, and there­fore S3 accounts)
  • there’s a Cloud Front ser­vice for con­tent deliv­ery, data are cached on the first request and there­fore speeds up sub­se­quent requests from the same region
  • Cloud­Ber­ry S3 Explor­er is the best UI client we’ve used in Win­dows
  • you can use the Ama­zonS­DK to write you own repos­i­to­ry lay­er which uti­lizes S3

These are the three core ser­vices which most peo­ple use AWS for, but there are oth­er use­ful ser­vices which I haven’t men­tioned yet, such as the Sim­ple Queue Ser­vice (SQS) and Elas­tic MapRe­duce, but those are more for edge cas­es.

Cost

Low­er cost of entry

The great thing about the pay-as-you-go pric­ing mod­el for cloud com­put­ing solu­tions in gen­er­al is that as a small start-up, or even indi­vid­u­als, you can have the capa­bil­i­ty to serve mil­lions of cus­tomers right from the word go with­out hav­ing to invest heav­i­ly up front on infra­struc­ture and hard­ware. This serves to low­er the cost of entry and there­fore the risk involved, which con­se­quent­ly encour­ages inno­va­tions.

Dimin­ish­ing val­ue of rent­ing

How­ev­er, draw­ing analo­gies from car rentals, if you only need a car occa­sion­al­ly it makes much more eco­nom­ic sense to sim­ply rent when­ev­er you need one, but as your needs increas­es there comes a point when it becomes cheap­er to actu­al­ly own a car out right. The same can be said of the cost of run­ning all your ser­vices out of AWS, espe­cial­ly for high per­for­mance instances required to run a Data­base for exam­ple, see below screen­shots for some exam­ples of the avail­able instance types and cor­re­spond­ing cost of rent­ing by the hour:

image

image

image

Reserv­ing instances

In addi­tion to the stan­dard ‘pay for what you use’ mod­el, Ama­zon also offers dis­counts when you reserve an instance for 1 or 3 year terms for a one-time fee, after which the instance is reserved for you.

image

So one way to cut your costs is to reserve the min­i­mum num­ber of instances you will need to run con­stant­ly based on the min­i­mum usage of your ser­vice and sup­ple­ment them with spot instances you request dynam­i­cal­ly to cope with spikes in traf­fic.

Using EC2 as a sup­ple­ment

Some com­pa­nies such as Zyn­ga uses a sim­i­lar approach, where they run major­i­ty of their ser­vices out of their own data cen­tres but uses EC2 instances to sup­ple­ment that and help them cope with surges in traf­fic.

This approach how­ev­er, doesn’t apply to ven­dors which uses a plat­form-as-a-ser­vice mod­el (e.g. Azure and AppEngine) because you are more tight­ly locked in with the spe­cif­ic ven­dor and can’t sim­ply run part of your ser­vice out­side of their plat­form. For exam­ple, if you’ve devel­oped your appli­ca­tion to use Azure, you won’t be able to run your appli­ca­tion out of your own servers because Azure as a plat­form is only pro­vid­ed by Microsoft.

Be ware of the addi­tion­al costs

When it comes to esti­mat­ing your cost, it’s easy to for­get about all the oth­er small charges you incur, such as the cost of data trans­fers, which whilst fair­ly cheap but depend­ing on your usage can eas­i­ly build up and eclipse the cost of run­ning the vir­tu­al servers. Take a flash game for exam­ple, the game and rel­e­vant assets etc. can eas­i­ly amount to a few megabytes. This on its own is noth­ing, but mul­ti­ply that by 500k, 1 mil­lion, 2 mil­lion, 10 mil­lion users, and then mul­ti­ply by the num­ber of updates/patches which requires the users to down­load the pack­age again, and soon you just might be look­ing at a rather siz­able bill in rela­tion to your data trans­fers..

Add to that the cost of oth­er periph­er­al ser­vices such as Cloud Watch or Load Bal­anc­ing, etc. etc. which are all per­fect­ly rea­son­able by any means, but they all add up in the end.

It’s pos­si­ble to mit­i­gate some of these addi­tion­al cost, for exam­ple, you could make use of the Cloud Front ser­vice to reduce the amount of data trans­fers from S3 (data is cached after the first request in a giv­en region), or bet­ter still you can archi­tect your appli­ca­tion so that it only loads resources at the time when they’re required but obvi­ous­ly this adds to the com­plex­i­ty of your appli­ca­tion and can also com­pli­cate the deploy­ment process.

Web Console

The stan­dard AWS Man­age­ment Con­sole (see image below) is good if unspec­tac­u­lar, and doesn’t cov­er the full range of the ser­vices Ama­zon pro­vides. There are also some impor­tant fea­tures miss­ing too, for instance, in order to talk to the Auto Scal­ing ser­vice to adjust the scal­ing para­me­ters (change max num­ber of instances from 10 to 5), you have to either down­load and use a com­mand line tool or use the Ama­zon SDK, or build some UI around it to make life eas­i­er for your­selves as we did.

image

There are oth­er ven­dors such as RightScale which pro­vides you with bet­ter tool­ing to help you manage/automate a lot of the work you have to oth­er­wise do your­self, but they usu­al­ly have pret­ty steep pric­ing and doesn’t rep­re­sent great val­ue for mon­ey for a small start­up look­ing to get up and run­ning cheap­ly. You know, the sort of folks that are attract­ed by cloud computing’s low cost entry point.. wait a minute…

Reliability

The well goes deep…

I read an inter­est­ing sto­ry not long ago about the attempt­ed Dis­trib­uted DOS attack on Ama­zon the Anony­mous col­lec­tive of online pro­test­ers tried to pull off (in protest of Ama­zon cut­ting Wik­iLeaks loose) after suc­cess­ful attempts on sev­er­al oth­er high pro­file sites. The attacked end­ed in fail­ure as the orga­niz­ers admit­ted that Ama­zon was too hard a nut to crack, so clear­ly the well is deep enough for all of us, and some more!

Out­ages and Per­for­mance Drops

Over the last year there had been sev­er­al out­ages but they were all resolved fair­ly quick­ly, but there had been sev­er­al instances where the per­for­mance (in terms of response time and/or num­ber of timed-out requests) had notice­ably dropped for both Sim­pleDB and S3.

When you request a new EC2 instance, you have to spec­i­fy which avail­abil­i­ty zone the instance should be spawned in, but for ser­vices such as Sim­pleDB you don’t have this con­trol and new instances are always spawned in the default avail­abil­i­ty zone for your cur­rent region.

For instance, if your cur­rent region is US East (N. Vir­ginia) the default avail­abil­i­ty zone is us-east-1a, and every­one who requests EC2 instances with­out chang­ing the default avail­abil­i­ty zone will be using the same zone and there­fore like­ly to cause a lot of con­ges­tion in that zone and affect oth­er ser­vices such as Sim­pleDB. There’s even been sev­er­al times when we sim­ply weren’t able to scale up our appli­ca­tion because the us-east-1a avail­abil­i­ty zone had no spare capac­i­ty!

It’s very impor­tant to build in lots of fault tol­er­ance into your appli­ca­tion when you’re using AWS, you should also avoid (where pos­si­ble) using the default avail­abil­i­ty zones for your EC2 instances as they tend to be the most like­ly to con­gest.

NOTE: as I men­tioned before about data trans­fer costs, data trans­fers between dif­fer­ent avail­abil­i­ty zones in the same region are free, so there’s no need to wor­ry about incur­ring extra costs by hav­ing your EC2 instances run­ning in a dif­fer­ent avail­abil­i­ty zone to that of Sim­pleDB/S3.

Bug Report­ing

Being a rapid­ly and con­stant­ly evolv­ing plat­form, it’s no sur­prise that there had been the odd bugs that have been intro­duced as the result of an update, e.g. for a lit­tle while no one was able to remote desk­top to any instance whose pub­lic IP starts with 50, e.g. 50.0.0.0…

There is an active dis­cus­sion forum where you can report any bugs you notice about the ser­vices, and Ama­zon employ­ees do mon­i­tor these forums and pro­vide help­ful feed­backs etc. In addi­tion to that, there’s also a ser­vice sta­tus dash­board you can use to check the cur­rent sta­tus of their ser­vices by date, and by region.

Parting thoughts…

So there, a not so quick :-P high lev­el sum­ma­ry of my expe­ri­ence with AWS over the last 12 months. To wrap things up, here’s just a cou­ple of blogs you could read reg­u­lar­ly to find out what’s going on the ‘cloud’:

High Scal­a­bil­i­ty

Cloud­Har­mo­ny

Ama­zon Web Ser­vices blog

Well, hope this helps you in some way, belat­ed hap­py 2011!