Slides and video from my webinar on localization and design pattern automation

Hello,

The slides and recording of my webinar on Tuesday is now live, thanks to the folks at PostSharp for the quick turnaround!

Upcoming webinar on Localization and Design Pattern Automation

Hello, just a quick note to say that I’m doing a webinar with the PostSharp folks on a technique my team developed whilst working on Here Be Monsters (a MMORPG that had more text than the first 3 Harry Porter books combined) which allowed us to localise the whole game with a handful of lines of code and an hour’s worth of work.

The webinar will be held at 12:00 EST / 17:00 GMT on Tuesday 15th November, and you can register for the webinar here.

In the webinar, I’ll cover:

  • common practices of localization
  • challenges and problems these common approaches
  • how to rethink the localization problem as an automatable implementation pattern
  • pattern automation with PostSharp

image.png image.png

Design for Latency issues

The most common issue I have encountered in production are latency/performance related. They can be symptoms of a whole host of underlying causes ranging from AWS network issues (which can also manifest itself in latency/error-rate spikes in any of the AWS services), over-loaded servers to simple GC pauses.

Latency issues are inevitable – as much as you can improve the performance of your application, things will go wrong, eventually, and often they’re out of your control.

So you must design for them, and degrade the quality of your application gracefully to minimize the impact on your users’ experiences.

As backend developers, one of the fallacies that we often fall into is to allow our dev environments to be too lenient. Servers and databases are never under load during development, so we lure client developers into a false sense of comfort and set them up to fail when the application runs into a slow-responding server in production for the first time.

 

Latency Injection

To program my fellow client developers to always be mindful of latency spikes, we decided to inject random latency delays on every request:

  1. check if we should inject a random delay;
  2. if yes, then work out how much latency to inject and sleep the thread;
  3. and finally invoke the original method to process the request

This is an implementation pattern that can be automated. I wrote a simple PostSharp attribute to do this, whilst piggybacking existing configuration mechanisms to control its behaviour at runtime.

Then I multicast the attribute to all our service endpoints and my work was done!

We run latency injection in our dev environment and it helped identify numerous bugs in the client application as a result and proved to be a worthwhile exercise.

But we didn’t stop there.

 

Error Injection

We throttle user requests to all of our services to stop mischievous players from spamming our servers using proxy tools such as Charles and Fiddler, or handcrafted bots.

But, occasionally, legitimate requests can also be throttled as result of client bugs, over-zealous retry strategy or incorrectly configured throttling threshold.

Once again, we decided to make these errors much more visible in the dev environment so that client developers expect them and handle them gracefully.

To do that, we:

  1. set the threshold very low in dev
  2. used a PostSharp attribute to randomly inject throttling error on operations where it makes sense

The attribute that injects throttling error is very simple, and looks something along the lines of:

The same approach can be taken to include any service specific errors that the client should be able to gracefully recover from – session expiration, state out-of-sync, etc.

 

Design for Failure

Simulating latency issues and other errors fall under the practice of Design for Failure, which Simon Wardley identifies as one of the characteristics of a next-generation tech company.

image

p.s. you should check out Simon’s work on value chain mapping if you haven’t already, they’re inspiring.

 

Chaos Engines

Netflix’s use of Chaos Monkey and Chaos Gorilla is a shining example of Design for Failure at scale.

Chaos Monkey randomly terminates instances to simulate hardware failures and test your system’s ability to withstand such failures.

Chaos Gorilla takes this exercise to the next level and simulate outages to entire Amazon availability zones to test their system’s ability to automatically re-balance to other availability zones without user-visible impact or manual intervention.

Netflix has taken a lot of inspiration from Release It! by Michael Nygard and Drift into Failure by Sidney Dekker. Both books are awesome and I highly recommend them.

image image

 

Global redundancy, or not

Based on reactions to AWS outages on social media, it’s clear to see that many (ourselves included) do not take full advantage of the cloud for global redundancy.

You might scoff at that but for many the decision to not have a globally redundant infrastructure is a conscious one because the cost of such redundancy is not always justifiable.

It’s possible to raise your single-point-of-failure (SPOF) from individual resources/instances, to AZs, to regions, all the way to cloud providers.

But you’re incurring additional costs at each turn:

  • your infrastructure becomes more complex and difficult to reason with;
  • you might need more engineers to manage that complexity;
  • you will need to invest in better tooling for monitoring and automation;
  • you might need more engineers to build those tools;
  • you incur more wastage in CPU/memory/bandwidth/etc. (it is called redundancy for a reason);
  • you have higher network latency for cross-AZ/region communications;

 

Global redundancy at Uber

On the other hand, for many organizations the cost of downtime outweighs the cost of global redundancy.

For instance, for Uber’s customers the cost of switching to a competitor is low, which means availability is of paramount importance for Uber.

Uber devised a rather simple, elegant mechanism for their client applications to failover seamlessly in the event of a datacentre outage. See this post for more details.

 

Latency Propagation

Finally, as more and more companies adopt a microservices approach a whole host of challenges will become evident (many of which have been discussed in Michael Nygard’s Release it!).

One of these challenges is the propagation of latency through inter-service communications.

If each of your services have a 99 percentile latency of 1s then only 1% of calls will take longer than 1s when you depend on only 1 service. But if you depend on 100 services then 63% of calls will take more than 1s!

In this regard, Google fellow Jeff Dean’s paper Achieving Rapid Response Times in Large Online Services presents an elegant solution to this problem.

image

I haven’t put this into practice myself, but I imagine this can be easily implemented using Rx’s amb operator.

 

Links

Learn to Learn

Being a responsible speaker, I have started preparing my talk – A tour of the language landscape – for NDC Oslo months ahead of time! When I first came up with the idea for this talk, I asked on Twitter if anyone else thought it was a good idea. Phil made a great point about including some information on how I go about learning a new language.

phil_reply

I came across this TEDx talk by Josh Kaufman a while back and found it useful in helping me formulate a learning strategy that works for me.

Truth about the 10,000 hours rule

In the talk, Josh debunked the often misquoted 10,000 hours rule. When the study was first published, the finding was that it takes 10,000 hours of deliberate practice to reach the top of an ultra-competitive field. But through a collective Chinese whisper the message was warped into ‘it takes 10,000 hours of deliberate practice to be good at something’.

Instead, Josh found that researches suggest 20 hours is all it takes for you to be reasonably good at something so long you make those hours count.

learning_curve

This is important, because for us busy programmers – who, by the way, have a tendency to work long hours – the time to learn new skills is both limited and necessary given how fast our industry moves.

4 steps to learn

Josh proposed these 4 steps to learning anything.

4-steps-to-learn-anything

Deconstruct the skill

Most things we consider as skills are actually bundles of skills. The more we are able to break them up into smaller skills the better we can decide which of them actually helps us achieve what we want out of our learning. We can then prioritise the skills that are most useful to us and improve our ability in the least time possible.

For learning a programming language, you can deconstruct most languages into smaller chunks:

  • variable assignment
  • common data types
  • control flows (if-else, loops, recursions, etc.)
  • working with collection types
  • working with strings
  • error handling
  • concurrency

Most introductory books and tutorials follow this structure already.

Learn enough to self-correct

You should first focus on getting to the point where you can self-correct and self-edit as you learn. For learning a programming language, I interpret this point as:

  • know how to compile and run your code
  • able to put simple programs together, and tweak it to start getting a ‘feel’ of it

Again, most introductory books and tutorials follow this pattern already and have you build a Hello World example very early on.

Remove practice barriers

Remove distractions – TV, internet, twitter, etc. – so that you can focus on learning. This can be hard when distractions are all around us and so readily available!

I once heard a story about John Carmack that, before a new project, he’d check in to a hotel with a bunch of good books and literally cut himself off from the outside world for days so he can soak up the ideas and inspirations before starting any work on the project.

I’m not saying that you should do the same, obviously different approaches work for different people. Personally I’m most effective between the hours of 10PM and 2AM because my wife goes to bed early and I’m able to just zone out.

I’m not a heavy Twitter user, or any other social network for that matter, so they’re not a problem for me.

On the other hand, comic-based TV shows is my poison – The Flash, Gotham, Arrow, Agents of SHIELD, etc. To limit the amount of disruption these bring, I binge watch them in one night so I can have the rest of the nights that week for more constructive uses.

Practice at least 20 hours

Josh raised a good point that, for most things you learn, there is a frustration barrier – the moment when we become consciously incompetent and realise how little we know and how much more we need to learn.

Conscious_Incompetence

It’s not a great feeling as no one likes to feel stupid, and this is often the point where we lose our momentum and derail our hard-earned progress.

Which is why it’s important that we pre-commit at least 20 hours of our time, so that if and when we hit this frustration point we have a good reason to push on since we already budgeted 20 hours anyway.

Set your goal

Before you start investing a minimum of 20 hours into learning a new language, it helps if you could decide what you want to get out of the process. Depending on your situation and needs this could be quite different, e.g.

  • are you looking to move to a different language stack and trying to make yourself employable?
  • are you trying to understand the hype around a new language and see what it’s all about?

Personally, most of my learning is aimed at expanding my horizon and allowing me to see beyond the possibilities and options I have at my disposal with the stack that I work with day-to-day.

Other times I might have specific goals of what I want to be able to do in that new language, for instance:

  • I learnt Dart as a replacement to JavaScript for my web development needs
  • I learnt Elm to be better acquainted with functional-reactive programming (FRP) and with the aim of being able to make games using FRP

Prioritise learning a new paradigm

One mistake that I see many people make is to choose to learn a new language over a new paradigm. For example, making the jump from C# to Java is a relative easy one, but at the end of day you have learnt a new syntax without necessarily taught yourself a new way to solve problems.

Learning a new paradigm on the other hand, fundamentally change the way you see programming and allow you to see new ways to solve problems. From personal experience, each time I ventured into a new paradigm – Functional Programming, Aspect-Oriented Programming, Functional Reactive Programming, etc. – has allowed me to see programming in a new light.

If you’re interested in exploring some less travelled roads, check out these three paradigms recommended by John Croisant.

These two books by Bruce Tate are also a great source for exploratory learning:

7-langs-7-weeks  7-more-langs-7-weeks

And finally, I leave you with a great quote from none other than Alan Perlis.

A language that doesn’t affect the way you think about programming, is not worth knowing.

– Alan Perlis

Happy learning!

Links

Metricano – simplifying application monitoring

On application monitoring

In the Gamesys social team, our view on application monitoring is such that anything that runs in production needs to be monitored extensively all the time – every service entry point, IO operations or CPU intensive tasks. Sure, it comes at the cost of a few CPU cycles which might mean that you have to run a few more servers at scale, but that’s small cost to pay compared to:

  • lack of visibility of how your application is performing in production; or
  • inability to spot issues occurring on individual nodes amongst large number of servers; or
  • longer time to discovery on production issues, which results in
    • longer time to recovery (i.e. longer downtime)
    • loss of customers (immediate result of downtime)
    • loss of customer confidence (longer term impact)

Services such as StackDriver and Amazon CloudWatch also allow you to set up alarms around your metrics so that you can be notified or some automated actions can be triggered in response to changing conditions in production.

In Michael T. Nygard’s Release It!: Design and Deploy Production-Ready Software (a great read by the way) he discussed at length how unfavourable conditions in production can cause cracks to appear in your system, and through tight coupling and other anti-patterns these cracks can accelerate and spread across your entire application and eventually bring it crashing down to its knees.

 

In applying extensive monitoring to our services we are able to:

  • see cracks appearing in production early; and
  • collect extensive amount of data for the post-mortem; and
  • use knowledge gained during post-mortems to identify early warning signs and set up alarms accordingly

 

Introducing Metricano

With our emphasis on monitoring, it should come as no surprise that we have long sought to make it easy for our developers to monitor different aspects of their service.

 

Now, we’ve made it easy for you to do the same with Metricano, an agent-based framework for collecting, aggregating and publishing metrics from your application. From a high-level, the MetricsAgent class collects metrics and aggregates them into second-by-second summaries. These summaries are then published to all the publishers you have configured.

 

Recording Metrics

There are a number of options for you to record metrics with MetricsAgent:

Manually

You can call the IncrementCountMetric, or RecordTimeMetric methods on an instance of IMetricsAgent (you can use MetricsAgent.Default or create a new instance with MetricsAgent.Create), for example:

 

F# Workflows

From F#, you can also use the built-in timeMetric and countMetric workflows:

 

PostSharp Aspects

Lastly, you can also use the CountExecution and LogExecutionTime attributes from the Metricano.PostSharpAspects nuget package, which can be applied at method, class and even assembly level.

The CountExecution attribute records a count metric with the fully qualified name of the method, whereas the LogExecutionTime attribute records execution times into a time metric with the fully qualified name of the method. When applied at class and assembly level, the attributes are multi-casted to all encompassed methods, private, public, instance and static. It’s possible to target specific methods, by name or visibility, etc. please refer to PostSharp’s documentation for detail.

 

Publishing Metrics

All the metrics you record won’t do you much good if they just stay inside the memory of your application server.

To get metrics out of your application server and into a monitoring service or dashboard, you need to tell Metricano to publish metrics with a set of publishers. There is a ready made publisher for Amazon CloudWatch service in the Metricano.CloudWatch nuget package.

To add a publisher to the pipeline, use the Publish.With static method, see example here.

Since the lowest granularity on Amazon CloudWatch is 1 minute, so as an optimization to cut down on the number of web requests (which also has a cost impact), CloudWatchPublisher will aggregate metrics locally and only publish the aggregates on a per minute basis.

If you want to publish your metric data to another service (StackDriver or New Relic for instance), you can create your own publisher by implementing the very simple IMetricsPublisher interface. This simple ConsolePublisher for instance, will calculate the 95 percentile execution time and print them:

image

In general I find the 95/97/99 percentile time metrics much more informative than simple averages, since averages are so easily biased by even a small number of outlying data points.

 

Summary

I hope you have enjoyed this post and that you’ll find Metricano a useful addition in your application.

I highly recommend reading Release It!, much of the patterns and anti-patterns discussed in the book are becoming more and more relevant in today’s world where we’re building smaller, more granular microservices. Even the simplest of applications have multiple integration points – social networks, cloud services, etc. – and they are places where cracks are likely to occur before they spread out to the rest of your application, unless, you have taken the measure to guard against such cascading failures.

If you decide to buy the book from amazon, please use the link I provide below or add the query string parameter tag=theburningmon-21 to the URL so that I can get a small referral fee and use it towards buying more books and finding more interesting things to write about here Smile

 

Links

Metricano project page

Release It!: Design and Deploy Production-Ready Software

Metricano nuget package

Metricano.CloudWatch nuget package

Metricano.PostSharpAspects nuget package

Red-White Push – Continuous Delivery at Gamesys Social