If you’ve been using S3 client in the AWS SDK for .Net you might have noticed that there are no meth­ods that let you inter­act with the fold­ers in a bucket. As it turns out, S3 does not sup­port fold­ers in the con­ven­tional sense*, every­thing is still a key value pair, but tools such as Cloud Berry or indeed the Ama­zon web con­sole sim­ply uses ‘/’ char­ac­ters in the key to indi­cate a folder structure.

This might seem odd at first but when you think about it, there are no folder struc­ture on your hard drive either, it’s a log­i­cal struc­ture the OS pro­vides for you to make it eas­ier for us mere mor­tals to work with.

Back to the topic at hand, what this means is that:

  • if you add an object with key myfolder/ to S3, it’ll be seen as a folder
  • if you add an object with key myfolder/myfile.txt to S3, it’ll be seen as a file myfile.txt inside a myfolder folder, if the folder object doesn’t exist already it’ll be added automatically
  • when you make a Lis­tO­b­jects call both myfolder/ and myfolder/myfile.txt will be included in the result

Cre­at­ing folders

To cre­ate a folder, you just need to add an object which ends with ‘/’, like this:

public void CreateFolder(string bucket, string folder)
{
    var key = string.Format(@"{0}/", folder);
    var request = new PutObjectRequest().WithBucketName(bucket).WithKey(key);
    request.InputStream = new MemoryStream();
    _client.PutObject(request);
}

Here is a thread on the Ama­zon forum which cov­ers this technique.

List­ing con­tents of a folder

With the Lis­tO­b­jects method on the S3 client you can pro­vide a pre­fix require­ment, and to get the list of objects in a par­tic­u­lar folder sim­ply add the path of the folder (e.g. topfolder/middlefolder/) in the request:

var request = new ListObjectsRequest().WithBucketName(bucket).WithPrefix(folder);

If you are only inter­ested in the objects (includ­ing fold­ers) that are in the top level of your folder/bucket then you’d need to do some fil­ter­ing on the S3 objects returned in the response, some­thing along the line of:

// get the objects at the TOP LEVEL, i.e. not inside any folders
var objects = response.S3Objects.Where(o => !o.Key.Contains(@"/"));

// get the folders at the TOP LEVEL only
var folders = response.S3Objects.Except(objects)
                      .Where(o => o.Key.Last() == '/' &&
                                  o.Key.IndexOf(@"/") == o.Key.LastIndexOf(@"/"));
Share

Time really does fly when you’re hav­ing fun! In a blink of the eye it’s been a whole year since I left Credit Suisse to start a career mak­ing social games with IwI. All and all, it’s been a year filled with lots of hard work, lots of learn­ing and hon­estly, a hell of a lot of fun! The great thing about doing some­thing com­pletely dif­fer­ent is that you get to learn about a new stack of tech­nolo­gies to tackle an alto­gether dif­fer­ent set of chal­lenges, and in the end I believe it has made me a bet­ter developer.

One of those new tech­nolo­gies which I’ve had to get used to is Ama­zon Web Ser­vices (AWS), Amazon’s cloud com­put­ing solu­tion, which I will give an account of my per­sonal expe­ri­ences of  work­ing close with it dur­ing this past year.

By now you’ve prob­a­bly heard plenty about cloud com­put­ing already (well chances are one of the providers have tried to sell their ser­vice to you already!), so I won’t bore you with those vague mar­ket­ing tagline (well, I got noth­ing to sell here any­way :-P) and won­der­ful sto­ries of how com­pany XYZ made a move to the cloud and are now able to serve mil­lions more cus­tomers whilst pay­ing peanuts to run their servers.

Well.. truth is, things aren’t always rosy and there had been sev­eral high pro­file dis­as­ters which resulted in com­pa­nies los­ing cred­i­bil­ity or in extreme cases, their entire business..(that’s what los­ing your entire data­base clus­ter does to you when you haven’t got ade­quate backup…). So there are new risks involved and new chal­lenges which need to be tackled.

AWS vs other cloud solutions

Whilst all my expe­ri­ence with cloud com­put­ing has been with AWS so I won’t com­ment on the pros and cons of other cloud solu­tions, but from my under­stand­ing of Microsoft’s Azure and Google’s AppEngine ser­vices they have a very dif­fer­ent model to that of AWS.

Azure and AppEngine’s model is best described as platform-as-a-service where you develop your code against a SDK sup­plied by Microsoft or Google which allows you to make use of their respec­tive SQL/NoSQL solu­tions, etc. Deploy­ments are usu­ally easy and the ser­vice will man­age the num­ber of instances (servers) needed to run your appli­ca­tion to meet the cur­rent demand for you whilst let­ting you set a max/min cap on the num­ber of instances so that your cost doesn’t spi­ral out of control.

One the other hand, Amazon’s model is infrastructure-as-a-service which is basi­cally the same as older vir­tual machine tech­nol­ogy but with lots of addi­tional knots and bots such as auto­mated pro­vi­sion­ing, auto scal­ing, auto­mated billing, etc. You have the abil­ity to cre­ate new machine images from exist­ing instances to use to spawn new instances and you can use the instances for pretty much all intents and pur­poses, e.g. web servers, MySQL server, cache clus­ter, etc. after all they’re just vir­tual machines.

In con­trast to Azure and AppEngine’s model, AWS’s model gives you more flex­i­bil­ity and con­trol but at the same time requires you to do a lit­tle bit more work to get going and take on the role of an IT pro as well as a developer.

Ven­dor Lock-in

Pos­si­bly the biggest com­plaint and worry peo­ple have about migrat­ing to the cloud is that you are locked in with a par­tic­u­lar ven­dor. Whilst that’s true, but given the cur­rent affairs and the way things are seem­ingly going, I’d say fears of the likes of Microsoft, Google or Ama­zon going bust and there­fore their respec­tive cloud ser­vices going up in smokes is.. well.. a lit­tle extreme.

It’s harder to move away from the Azure and AppEngine given that every­thing you’ve coded are against a par­tic­u­lar set of SDK but with AWS’s infrastructure-as-a-service model it’s pos­si­ble to migrate out of it with too much has­sle. In fact, Zynga uses AWS as incu­ba­tion cham­ber for their new games and allow them to mon­i­tor and learn about the host­ing require­ments for a new game in its tur­bu­lent early days (well they lit­er­ally go from hun­dreds to mil­lions of users in the mat­ter of days so you can imag­ine…) before mov­ing the game into their own data cen­tres once the traf­fic stabilises.

Pric­ing

In terms of pric­ing, there’s lit­tle that sep­a­rates the three providers I’ve men­tioned here, in fact, the last time I looked, the equiv­a­lent instances in Azure costs the equiv­a­lent amount as their AWS coun­ter­part so clearly the mar­ket research divi­sions have been work­ing hard to know exactly what their com­peti­tors have been up to :-P

Matu­rity

Hav­ing been released in 2006, AWS is one of the if not the old­est cloud ser­vice out there and even then it’s still barely four years old and still a lit­tle rough around the edges (they don’t call it cut­ting edge for no rea­son I sup­pose). It has steadily improved both in terms of fea­ture as well as tool­ing and there is an active com­mu­nity out there build­ing sup­ple­men­tary tools/frameworks for it in dif­fer­ent lan­guages. I also find that the doc­u­men­ta­tions Ama­zon pro­vides are in gen­eral up-to-date and useful.

Pop­u­lar­ity

Much to my sur­prise, I found out at a recent cloud com­put­ing con­fer­ence that none of Microsoft, Google or Ama­zon made the list of top 3 cloud com­put­ing ser­vice providers. Instead, SalesForce.com made num­ber one and I can’t remem­ber who the other two providers are…sorry..

Ser­vices

AWS offers a whole ecosys­tem of dif­fer­ent ser­vices which should cover most aspects of a given archi­tec­ture, from ser­vice host­ing to data stor­age, mes­sag­ing, they even recently announced a new DNS ser­vice called Route 53!

There’s a quick run down of the three most impor­tant ser­vices which AWS offers:

Elas­tic Cloud Com­put­ing (EC2)

EC2 forms the back­bone of Amazon’s cloud solu­tion, its key char­ac­ter­is­tics include:

  • you pay for what you use at a per instance per hour rate
  • you pay for the amount of data trans­fers in and out of Ama­zon EC2 (data trans­fer between Ama­zon EC2 and other Ama­zon Web Ser­vices in the same region is free)
  • there are a num­ber of dif­fer­ent OS’s to choose from includ­ing Linux and Win­dows, Linux instances are cheaper to run with­out the license cost asso­ci­ated with Windows
  • there are a good range of dif­fer­ent instance types to choose from, from the most basic (sin­gle CPU, 1.7GB ram) to high per­for­mance instances (22 GB ram, 2 x Intel Xeon X5570, quad-core Nehalem, 2 x NVIDIA Tesla Fermi M2050 GPUs)
  • a basic instance run­ning win­dows will cost you roughly $3 a day to run non-stop
  • default machine images are reg­u­larly patched by Ama­zon
  • once you’ve set up your server to be the way you want, includ­ing any server updates/patches, you can cre­ate your own AMI (Ama­zon Machine Image) which you can then use to bring up other iden­ti­cal instances
  • there’s a Elas­tic Load Bal­anc­ing ser­vice which pro­vides load bal­anc­ing capa­bil­i­ties at addi­tional cost (though most of the time you’ll only need one)
  • there’s a Cloud Watch ser­vice which you can enable on a per instance basis to help you mon­i­tor the CPU, net­work in/out, etc. of your instances, this ser­vice also has its own cost
  • you can use the Auto Scal­ing ser­vice to auto­mat­i­cally bring up or ter­mi­nate instances based on some met­ric, e.g. ter­mi­nate 1 instance at a time if aver­age CPU uti­liza­tion across all instances is less than 50% for 5 min­utes con­tin­u­ously, but bring up 1 new instance at a time if aver­age CPU goes beyond 70& for 5 minutes
  • you can use the instances as web servers, DB servers, Mem­cached clus­ter, etc. choice is yours
  • round trips within Ama­zon is very very fast, but trips out of Ama­zon are sig­nif­i­cantly slower, there­fore the usual approach is to use Ama­zon Sim­pleDB (see below) or Ama­zon RDS as the Data­base (should you need one that is)
  • Ama­zonSDK is pretty solid and con­tains enough classes to help you write some cus­tom monitoring/scaling ser­vice if you ever need to but the AWS Man­age­ment Con­sole (see lower down) lets you do most basic oper­a­tions anyway

Sim­pleDB

Amazon’s NoSQL data­base, it is a non-relational, dis­trib­uted, key-value data store, its key char­ac­ter­is­tics include:

  • com­pared to tra­di­tional rela­tional Data­bases it has lower per-request per­for­mance, typ­i­cal 15-20ms oper­a­tions tend to take any­thing between 75-100ms to complete
  • in return, you get high scal­a­bil­ity with­out hav­ing to do any work
  • you pay for usage – how much work it takes to exe­cute your query
  • you start off with a sin­gle instance host­ing your data, instances are auto scaled up and down depend­ing on traf­fic, there’s no way to change the num­ber of Data­base instance to start off with
  • sup­ports a SQL-like query­ing syn­tax, though still fairly limited
  • for .Net, MindScape’s Sim­pleDB Man­age­ment Tools is the best man­age­ment tool we’ve used, it inte­grates directly into Visual Stu­dio and at $29 a head it’s not expen­sive either
  • most per­form­ing when traf­fic increases/decreases steadily, there’s a notice­able slump in response times when there’s a sud­den surge of traf­fic as new instances takes around 10–15 mins to be ready to ser­vice requests
  • data are par­ti­tion into ‘domains’, which are equiv­a­lent to tables in a rela­tional Database
  • data are non-relational, if you need a rela­tional model then use Ama­zon RDS, I don’t have any expe­ri­ence with it so not the best per­son to com­ment on it
  • be aware of ‘even­tual con­sis­tency’, data are dupli­cated on mul­ti­ple instances after Ama­zon scales up your data­base to meet the cur­rent traf­fic, and syn­chro­niza­tion is not guar­an­teed when you do an update so it’s pos­si­ble though highly unlikely to update some data then read it back straight away and get the old data back
  • there are ‘con­sis­tent read’ and ‘con­di­tional update’ mech­a­nisms avail­able to guard against the even­tual con­sis­tency prob­lem, if you’re devel­op­ing in .Net, I sug­gest using Sim­ple­Sa­vant client to talk to Sim­pleDB, it’s a fairly feature-complete ORM for Sim­pleDB which already sup­ports both con­sis­tent reads and con­di­tional updates

Sim­ple Stor­age Ser­vice (S3)

Amazon’s stor­age ser­vice, again, extremely scal­able, and safe too – when you save a file to S3 it’s repli­cated across mul­ti­ple nodes so you get some DR abil­ity straight away. Many pop­u­late ser­vices such as Drop­Box uses it behind the scene and it’s also the stor­age of choice for many image host­ing ser­vices. Key char­ac­ter­is­tics include:

  • you pay for data trans­fer in and out (data trans­fer between EC2 and S3 in the same region is free)
  • files are stored against a key
  • you cre­ate ‘buck­ets’ to hold your files, and each bucket has a unique URL (unique across all of Ama­zon, and there­fore S3 accounts)
  • there’s a Cloud Front ser­vice for con­tent deliv­ery, data are cached on the first request and there­fore speeds up sub­se­quent requests from the same region
  • Cloud­Berry S3 Explorer is the best UI client we’ve used in Windows
  • you can use the Ama­zonSDK to write you own repos­i­tory layer which uti­lizes S3

These are the three core ser­vices which most peo­ple use AWS for, but there are other use­ful ser­vices which I haven’t men­tioned yet, such as the Sim­ple Queue Ser­vice (SQS) and Elas­tic MapRe­duce, but those are more for edge cases.

Cost

Lower cost of entry

The great thing about the pay-as-you-go pric­ing model for cloud com­put­ing solu­tions in gen­eral is that as a small start-up, or even indi­vid­u­als, you can have the capa­bil­ity to serve mil­lions of cus­tomers right from the word go with­out hav­ing to invest heav­ily up front on infra­struc­ture and hard­ware. This serves to lower the cost of entry and there­fore the risk involved, which con­se­quently encour­ages innovations.

Dimin­ish­ing value of renting

How­ever, draw­ing analo­gies from car rentals, if you only need a car occa­sion­ally it makes much more eco­nomic sense to sim­ply rent when­ever you need one, but as your needs increases there comes a point when it becomes cheaper to actu­ally own a car out right. The same can be said of the cost of run­ning all your ser­vices out of AWS, espe­cially for high per­for­mance instances required to run a Data­base for exam­ple, see below screen­shots for some exam­ples of the avail­able instance types and cor­re­spond­ing cost of rent­ing by the hour:

image

image

image

Reserv­ing instances

In addi­tion to the stan­dard ‘pay for what you use’ model, Ama­zon also offers dis­counts when you reserve an instance for 1 or 3 year terms for a one-time fee, after which the instance is reserved for you.

image

So one way to cut your costs is to reserve the min­i­mum num­ber of instances you will need to run con­stantly based on the min­i­mum usage of your ser­vice and sup­ple­ment them with spot instances you request dynam­i­cally to cope with spikes in traffic.

Using EC2 as a supplement

Some com­pa­nies such as Zynga uses a sim­i­lar approach, where they run major­ity of their ser­vices out of their own data cen­tres but uses EC2 instances to sup­ple­ment that and help them cope with surges in traffic.

This approach how­ever, doesn’t apply to ven­dors which uses a platform-as-a-service model (e.g. Azure and AppEngine) because you are more tightly locked in with the spe­cific ven­dor and can’t sim­ply run part of your ser­vice out­side of their plat­form. For exam­ple, if you’ve devel­oped your appli­ca­tion to use Azure, you won’t be able to run your appli­ca­tion out of your own servers because Azure as a plat­form is only pro­vided by Microsoft.

Be ware of the addi­tional costs

When it comes to esti­mat­ing your cost, it’s easy to for­get about all the other small charges you incur, such as the cost of data trans­fers, which whilst fairly cheap but depend­ing on your usage can eas­ily build up and eclipse the cost of run­ning the vir­tual servers. Take a flash game for exam­ple, the game and rel­e­vant assets etc. can eas­ily amount to a few megabytes. This on its own is noth­ing, but mul­ti­ply that by 500k, 1 mil­lion, 2 mil­lion, 10 mil­lion users, and then mul­ti­ply by the num­ber of updates/patches which requires the users to down­load the pack­age again, and soon you just might be look­ing at a rather siz­able bill in rela­tion to your data transfers..

Add to that the cost of other periph­eral ser­vices such as Cloud Watch or Load Bal­anc­ing, etc. etc. which are all per­fectly rea­son­able by any means, but they all add up in the end.

It’s pos­si­ble to mit­i­gate some of these addi­tional cost, for exam­ple, you could make use of the Cloud Front ser­vice to reduce the amount of data trans­fers from S3 (data is cached after the first request in a given region), or bet­ter still you can archi­tect your appli­ca­tion so that it only loads resources at the time when they’re required but obvi­ously this adds to the com­plex­ity of your appli­ca­tion and can also com­pli­cate the deploy­ment process.

Web Con­sole

The stan­dard AWS Man­age­ment Con­sole (see image below) is good if unspec­tac­u­lar, and doesn’t cover the full range of the ser­vices Ama­zon pro­vides. There are also some impor­tant fea­tures miss­ing too, for instance, in order to talk to the Auto Scal­ing ser­vice to adjust the scal­ing para­me­ters (change max num­ber of instances from 10 to 5), you have to either down­load and use a com­mand line tool or use the Ama­zon SDK, or build some UI around it to make life eas­ier for your­selves as we did.

image

There are other ven­dors such as RightScale which pro­vides you with bet­ter tool­ing to help you manage/automate a lot of the work you have to oth­er­wise do your­self, but they usu­ally have pretty steep pric­ing and doesn’t rep­re­sent great value for money for a small startup look­ing to get up and run­ning cheaply. You know, the sort of folks that are attracted by cloud computing’s low cost entry point.. wait a minute…

Reli­a­bil­ity

The well goes deep…

I read an inter­est­ing story not long ago about the attempted Dis­trib­uted DOS attack on Ama­zon the Anony­mous col­lec­tive of online pro­test­ers tried to pull off (in protest of Ama­zon cut­ting Wik­iLeaks loose) after suc­cess­ful attempts on sev­eral other high pro­file sites. The attacked ended in fail­ure as the orga­niz­ers admit­ted that Ama­zon was too hard a nut to crack, so clearly the well is deep enough for all of us, and some more!

Out­ages and Per­for­mance Drops

Over the last year there had been sev­eral out­ages but they were all resolved fairly quickly, but there had been sev­eral instances where the per­for­mance (in terms of response time and/or num­ber of timed-out requests) had notice­ably dropped for both Sim­pleDB and S3.

When you request a new EC2 instance, you have to spec­ify which avail­abil­ity zone the instance should be spawned in, but for ser­vices such as Sim­pleDB you don’t have this con­trol and new instances are always spawned in the default avail­abil­ity zone for your cur­rent region.

For instance, if your cur­rent region is US East (N. Vir­ginia) the default avail­abil­ity zone is us-east-1a, and every­one who requests EC2 instances with­out chang­ing the default avail­abil­ity zone will be using the same zone and there­fore likely to cause a lot of con­ges­tion in that zone and affect other ser­vices such as Sim­pleDB. There’s even been sev­eral times when we sim­ply weren’t able to scale up our appli­ca­tion because the us-east-1a avail­abil­ity zone had no spare capacity!

It’s very impor­tant to build in lots of fault tol­er­ance into your appli­ca­tion when you’re using AWS, you should also avoid (where pos­si­ble) using the default avail­abil­ity zones for your EC2 instances as they tend to be the most likely to congest.

NOTE: as I men­tioned before about data trans­fer costs, data trans­fers between dif­fer­ent avail­abil­ity zones in the same region are free, so there’s no need to worry about incur­ring extra costs by hav­ing your EC2 instances run­ning in a dif­fer­ent avail­abil­ity zone to that of Sim­pleDB/S3.

Bug Report­ing

Being a rapidly and con­stantly evolv­ing plat­form, it’s no sur­prise that there had been the odd bugs that have been intro­duced as the result of an update, e.g. for a lit­tle while no one was able to remote desk­top to any instance whose pub­lic IP starts with 50, e.g. 50.0.0.0…

There is an active dis­cus­sion forum where you can report any bugs you notice about the ser­vices, and Ama­zon employ­ees do mon­i­tor these forums and pro­vide help­ful feed­backs etc. In addi­tion to that, there’s also a ser­vice sta­tus dash­board you can use to check the cur­rent sta­tus of their ser­vices by date, and by region.

Part­ing thoughts…

So there, a not so quick :-P high level sum­mary of my expe­ri­ence with AWS over the last 12 months. To wrap things up, here’s just a cou­ple of blogs you could read reg­u­larly to find out what’s going on the ‘cloud’:

High Scal­a­bil­ity

Cloud­Har­mony

Ama­zon Web Ser­vices blog

Well, hope this helps you in some way, belated happy 2011!

Share

If you’re read­ing this post you prob­a­bly already know about Ama­zon Elas­tic Block Store, aka Ama­zon EBS, one of the many ser­vices pro­vided by the Ama­zon Web Ser­vices (AWS) ecosys­tem.

An EBS vol­ume can range from 1GB to 1TB and can be mounted to an Ama­zon EC2 instance as a device, each EBS vol­ume can be mounted to one instance but mul­ti­ple vol­umes can be mounted to the same instance. To mount an EBS vol­ume to an instance you have to first log into the AWS Man­age­ment Con­sole and fol­low a few sim­ple steps.

Cre­at­ing a new volume

1. Once inside the AWS Man­age­ment Con­sole, click the Vol­umes link under Elas­tic Block Store in the Nav­i­ga­tion panel:image

2. Click the ‘Cre­ate Vol­ume’ but­ton:

image 

3. In the pop-up form, choose the size of the EBS vol­ume you want to cre­ate and select the avail­abil­ity zone (remem­ber, the EBS vol­ume needs to be in the same avail­abil­ity zone as the instance it needs to be attached to):

image

Click “Cre­ate” to cre­ate a new EBS volume.

Attach­ing an EBS Vol­ume to an Instance

1. Once the EBS vol­ume is cre­ated and its sta­tus changed to ‘avail­able’, select it, and click the “Attach Vol­ume” but­ton:

image

2. In the fol­low­ing pop-up, choose the instance you want to attach the volume:

image

Mount the EBS Vol­ume as a drive

1. Now that the EBS vol­ume is attached to the instance, remote con­nect to the instance, go to Start –> Run, and enter “diskmgmt.msc” to start the Disk Man­age­ment tool:

image

2. In the Disk Man­age­ment tool, right click on the new disk, which is “Offline” right now and bring it “Online”:

image

3. The disk then needs to be ini­tial­ized, right click again and select “Ini­tial­ize Disk”:

image

4. Choose whether to ini­tial­ize the disk with MBR or GPT:

image

5. You’re almost there! Right click on the unal­lo­cated space in the ini­tial­ized disk and cre­ate a “New Sim­ple Vol­ume”:

image

and fol­low the steps in the wizard.

6. A 10GB vol­ume will prob­a­bly take 5 ~ 10 mins to for­mat, and once it’s done you’ll be able to see the drive:

image 

image

Part­ing thoughts…

As men­tioned in Membase’s doc­u­men­ta­tion here, when using an Ama­zon EC2 instance as a Mem­base cache server you can use an EBS vol­ume to alle­vi­ate the risk of los­ing your data when an instance fails because the EBS vol­ume can be attached to another instance to effec­tively restore the node.

Another fun fact about EBS and Ama­zon EC2 is that Win­dows 2008 instances are backed and run directly from an EBS vol­ume because of the sheer size of the images them­selves! Which is why for every Win­dows Server 2008 instance you are run­ning you’ll find a 30GB EBS vol­ume attached to that instance, and every AMI you cre­ate will have a match­ing EBS snap­shot. And the impli­ca­tion of this is that, if you’re using a Win­dows instance to run Mem­base, you won’t need to cre­ate addi­tional EBS vol­umes to store the Mem­base sqlite files. BUT, there’s a catch, if you ter­mi­nate the instance man­u­ally or through a scal­ing down event the EBS vol­ume will be deleted too, so be sure to take this into con­sid­er­a­tion when deriv­ing your strat­egy for deploy­ing and scal­ing your Mem­base cache clusters.

Share