Slides and video from my webinar on localization and design pattern automation

Hello,

The slides and recording of my webinar on Tuesday is now live, thanks to the folks at PostSharp for the quick turnaround!

Upcoming webinar on Localization and Design Pattern Automation

Hello, just a quick note to say that I’m doing a webinar with the PostSharp folks on a technique my team developed whilst working on Here Be Monsters (a MMORPG that had more text than the first 3 Harry Porter books combined) which allowed us to localise the whole game with a handful of lines of code and an hour’s worth of work.

The webinar will be held at 12:00 EST / 17:00 GMT on Tuesday 15th November, and you can register for the webinar here.

In the webinar, I’ll cover:

  • common practices of localization
  • challenges and problems these common approaches
  • how to rethink the localization problem as an automatable implementation pattern
  • pattern automation with PostSharp

image.png image.png

Design for Latency issues

The most common issue I have encountered in production are latency/performance related. They can be symptoms of a whole host of underlying causes ranging from AWS network issues (which can also manifest itself in latency/error-rate spikes in any of the AWS services), over-loaded servers to simple GC pauses.

Latency issues are inevitable – as much as you can improve the performance of your application, things will go wrong, eventually, and often they’re out of your control.

So you must design for them, and degrade the quality of your application gracefully to minimize the impact on your users’ experiences.

As backend developers, one of the fallacies that we often fall into is to allow our dev environments to be too lenient. Servers and databases are never under load during development, so we lure client developers into a false sense of comfort and set them up to fail when the application runs into a slow-responding server in production for the first time.

 

Latency Injection

To program my fellow client developers to always be mindful of latency spikes, we decided to inject random latency delays on every request:

  1. check if we should inject a random delay;
  2. if yes, then work out how much latency to inject and sleep the thread;
  3. and finally invoke the original method to process the request

This is an implementation pattern that can be automated. I wrote a simple PostSharp attribute to do this, whilst piggybacking existing configuration mechanisms to control its behaviour at runtime.

Then I multicast the attribute to all our service endpoints and my work was done!

We run latency injection in our dev environment and it helped identify numerous bugs in the client application as a result and proved to be a worthwhile exercise.

But we didn’t stop there.

 

Error Injection

We throttle user requests to all of our services to stop mischievous players from spamming our servers using proxy tools such as Charles and Fiddler, or handcrafted bots.

But, occasionally, legitimate requests can also be throttled as result of client bugs, over-zealous retry strategy or incorrectly configured throttling threshold.

Once again, we decided to make these errors much more visible in the dev environment so that client developers expect them and handle them gracefully.

To do that, we:

  1. set the threshold very low in dev
  2. used a PostSharp attribute to randomly inject throttling error on operations where it makes sense

The attribute that injects throttling error is very simple, and looks something along the lines of:

The same approach can be taken to include any service specific errors that the client should be able to gracefully recover from – session expiration, state out-of-sync, etc.

 

Design for Failure

Simulating latency issues and other errors fall under the practice of Design for Failure, which Simon Wardley identifies as one of the characteristics of a next-generation tech company.

image

p.s. you should check out Simon’s work on value chain mapping if you haven’t already, they’re inspiring.

 

Chaos Engines

Netflix’s use of Chaos Monkey and Chaos Gorilla is a shining example of Design for Failure at scale.

Chaos Monkey randomly terminates instances to simulate hardware failures and test your system’s ability to withstand such failures.

Chaos Gorilla takes this exercise to the next level and simulate outages to entire Amazon availability zones to test their system’s ability to automatically re-balance to other availability zones without user-visible impact or manual intervention.

Netflix has taken a lot of inspiration from Release It! by Michael Nygard and Drift into Failure by Sidney Dekker. Both books are awesome and I highly recommend them.

image image

 

Global redundancy, or not

Based on reactions to AWS outages on social media, it’s clear to see that many (ourselves included) do not take full advantage of the cloud for global redundancy.

You might scoff at that but for many the decision to not have a globally redundant infrastructure is a conscious one because the cost of such redundancy is not always justifiable.

It’s possible to raise your single-point-of-failure (SPOF) from individual resources/instances, to AZs, to regions, all the way to cloud providers.

But you’re incurring additional costs at each turn:

  • your infrastructure becomes more complex and difficult to reason with;
  • you might need more engineers to manage that complexity;
  • you will need to invest in better tooling for monitoring and automation;
  • you might need more engineers to build those tools;
  • you incur more wastage in CPU/memory/bandwidth/etc. (it is called redundancy for a reason);
  • you have higher network latency for cross-AZ/region communications;

 

Global redundancy at Uber

On the other hand, for many organizations the cost of downtime outweighs the cost of global redundancy.

For instance, for Uber’s customers the cost of switching to a competitor is low, which means availability is of paramount importance for Uber.

Uber devised a rather simple, elegant mechanism for their client applications to failover seamlessly in the event of a datacentre outage. See this post for more details.

 

Latency Propagation

Finally, as more and more companies adopt a microservices approach a whole host of challenges will become evident (many of which have been discussed in Michael Nygard’s Release it!).

One of these challenges is the propagation of latency through inter-service communications.

If each of your services have a 99 percentile latency of 1s then only 1% of calls will take longer than 1s when you depend on only 1 service. But if you depend on 100 services then 63% of calls will take more than 1s!

In this regard, Google fellow Jeff Dean’s paper Achieving Rapid Response Times in Large Online Services presents an elegant solution to this problem.

image

I haven’t put this into practice myself, but I imagine this can be easily implemented using Rx’s amb operator.

 

Links

Announcing libraries for C# and F# to make it easier to integrate with Sentry

Here at Gamesys social team, we’re rethinking our current approach to logging in general, from both server and client’s perspective. Having looked at many different alternatives (it was a little hard to imagine how crowded a space log aggregation and visualization is..) one of the services which we have decided to experiment with is Sentry.

Sentry is a fairly simple service, with an easy to use API and straight forward to integrate with, especially if you already have a client library (the Sentry doc refers to them them as Ravens) for your language of choice. On the .Net side of things, you have a little library called SharpRaven.

As for integration, using custom log4net appender such as this one is obviously a good way to go, but you still need to implement the try-catch-log pattern everywhere though, unless you’re happy for these exceptions to bubble all the way up to the app domain and catch them there. And when I see implementation patterns I see opportunities to automate them with PostSharp!

C# custom attributes

If you grab the SharpRaven-Contrib package from Nuget you’ll have access to a pair of custom attributes – RavenLogException and RavenLogExecutionTimeAttribute – when you open the SharpRaven namespace. For example,

The attributes does what they say on the tin, RavenLogException captures and logs exception information as errors to Sentry whilst RavenLogExecutionTime monitors execution time of your methods and logs any method execution that took longer than your given threshold as warnings to Sentry.

For F# however, whilst the attributes would still work for methods, chances are you will be spending most of your time working and composing functions instead and these attributes won’t help you there. So for F# I decided to do something slightly different.

F# workflows

Thankfully, in F#, we have computation expressions* (aka workflows) which already power language features such as async workflows and sequence comprehensions.

Using the workflows defined in the SharpRaven-ContribFs package you can create blocks of code where:

  • any unhandled exceptions are logged as Error in Sentry
  • if the block of code takes longer than the specified threshold to execute, it’ll be logged as a warning in Sentry

and your code remains unchanged, you simply wrap them in { }:

Of course, you can also just create wrapper functions to achieve the same results, but I find that using workflows in this case makes for more readable code. Another good alternative is to use a Maybe monad, which I won’t go into too much detail here as Scott Wlaschin has a great explanation for this already.

 

As always, the source code for both libraries are available on github, and if you find any issues feel free to report them via the issues page.

 

* if you’re interested in learning more about computation expressions, I highly recommend Scott Wlaschin’s series on his F# for Fun and Profit blog, it’s by far the most comprehensive and easy to understand set of articles I have seen.

 

Links

AOP – A story of how we localized a MMORPG with minimal effort

In Here Be Monsters*, we have a story-driven, episodic MMORPG that has over 3500 items and 1500 quests, and with more text than the first three Harry Potter books combined – so it represented a fairly sizable challenge when we made the decision to localize the whole game!

 

The Challenge

From a technical point of view, the shear volume of words is of little consequence, although it is a significant cost concern. It’s the number of places that require localization that represents a maintainability headache.

With a conventional approach, the client application would consume a gettext file containing all the translations, and anywhere it needs to display some text it’ll substitute the original text with the localized text instead.

We found a number of issues with this approach:

  1. large number of files – Domain Objects/DTOs/game logic/view – need to change during implementation
  2. all future changes need to take localization into account
  3. need to replicate changes across client platforms (Flash, iOS, etc.)
  4. hard to get good test coverage given the scope, especially across all client platforms
  5. easy for regression to creep in during our frequent release cycles
  6. complicates and lengthens regression tests and puts more pressure on already stretched QA resources

 

Sounds like a dark cloud is about to take permanent residence above all our heads? It felt that way.

 

Our Solution

Instead, we decided to perform localization on the server as part of the pipeline that validates and publishes the data (quest,s achievements, items, etc.) captured in our custom CMS. The publishing process first generates domain objects that are consumable by our game servers, then converts them to DTOs for the clients.

This approach partially addresses points 3 and 4 above as it centralizes the bulk of the localization work. But it still leaves plenty of unanswered questions, the most important was the question of how to implement a solution that is:

  • simple
  • clean – it shouldn’t convolute our code base
  • maintainable – it should be easy to maintain and hard to make mistakes even as we continue to evolve our code base
  • scalable – it should continue to work well as we add more languages and localized DTO types

 

To answer this question, we derived a simple and yet effective solution:

  1. ingest the gettext translation file (the nuget package SecondLanguage comes in very handy here)
  2. use a PostSharp attribute to intercept string property setters on DTOs to replace input string with the localized version
  3. repeat for each language to generate a language specific version of the DTOs

 

For those of you who are not familiar with it, PostSharp is an Aspect-Oriented Programming (AOP) framework for .Net, very similar to AspectJ for Java.

Here is a simplified version of what our Localize attribute looks like:

 

To automatically apply localization to all present and future DTO types (assuming that all the DTO types are defined in one project), simply multicast the attribute and target all types that follows our naming convention:

[assembly: Localize(AttributeTargetTypes = “*DTO”)]

and voila, we have localized over 95% of the game with one line of code!

and here’s an example of how an almanac page in the game looks in both English and Brazilian Portuguese:

image

image

 

I hope you find this little story of how we localized our MMORPG interesting, and the morale of the story is really that there is much more to AOP than the same old examples you might have heard so many times before – logging, validation, etc.

With a powerful framework like PostSharp, you are able to do meta-programming on the .Net platform in a structured and disciplined way and tackle a whole range of problems that would otherwise be difficult to solve. To name a few that pops into mind:

the list goes on, and many of these are available as part of the PostSharp pattern library too so you even get them out of the box.

 

Links

Design Pattern Automation

PostSharp

 

*you can try the game out on Facebook, HereBeMonstersGame.com or iPad (Monsters HD)