The monster trapping mechanics in Here Be Monsters is fairly straight forward:
- Monsters have a type and a set of stats – Strength, Speed and IQ
- They have a rarity value which determines the likelihood of an encounter
- They have a set of baits they like, which can increase the likelihood of an encounter
- Traps can catch monsters of matching types
- Traps also have a set of stats – Strength, Speed and Tech
- Chance of catching a monster is determined by the trap’s stats vs the monster’s stats
It’s as simple as it sounds. Unless, of course, you’re the game designer responsible for setting the stats for the trap so that:
a. you achieve the intended catch rate % against each of the monsters, and
b. the distribution of the stats should ‘make sense’, i.e. a low-tech box trap should have higher stats in strength than Tech
The naive approach would be to start with a guesstimate and then use trial-and-error until you converge upon an answer or an approximation to the answer that is considered good enough. The naive approach would be laborious and error prone, and unlikely to yield the optimal result (barring the Herculean effort of a persistent game designer..).
To automate this process and aid our game designers, we designed and implemented a simple genetic algorithm in F# that would search and find the optimal solution based on:
- intended % catch rate for each monster
- an error margin
- an initial set of stats that defines the ideal distribution of stats
The game designers can use our custom web tool to run the algorithm, for example:
In simple terms, a genetic algorithm starts with a set of potential solutions and iteratively generates new generations of solutions using a selection and a mutation process such that:
- the selection process chooses which of the solutions survive (survival of the fittest and all) based on a fitness function
- the mutation process generates a new solutions using the surviving solutions
the iteration continues until one of the terminations conditions have been met, for example, if a solution is found or we’ve reached the maximum number of generations allowed.
In our algorithm, each solution is a set of stats for the trap, and the selection process calculates the catch rate for each of the monsters using the solution, and keeps the solution if it’s better than the solution it’s mutated from.
The mutation process then takes each of he surviving solutions and generates new solutions by tweaking the stats in a number of ways:
- +/- a small amount from each of Strength/Speed/Tech (generates better solutions when we’re close to optimal solutions)
- +/- a large amount from each of Strength/Speed/Tech (generates noticeably different solutions we’re far from optimal solutions)
So from an initial solution of Strength:100, Speed:100 and Tech:200, you can end up with a number of different solutions for the next generation:
This process continues until either:
- the max number of generations has been exhausted, or
- none of the new solutions survive the selection process
the final survivors are then filtered using the error margin specified by the game designer, and sorted by how close it’s to the specified target catch rates, e.g.
We have also applied the same technique and implemented genetic algorithms to:
- find stats for a monster that will give it the intended catch rate for a number of traps (the inverse of the above)
- find configuration for baits so that we can achieve the desired encounter rate with a monster when using this bait (see below for an example)
So here you ago, I hope you enjoyed reading about another place where a bit of F# magic has come in handy.
The code for the genetic algorithms is not very complicated (or very long) but incredibly specific to our specific domain, hence the lack of any code snippet in this post. But hopefully I’ve managed to give you at least a flavour of what genetic algorithms are and how you might be able to apply them (with F# if possible!) in your own solution.
Nowadays you see plenty of stories about Continuous Integration, Continuous Delivery and Continuous Deployment on the web, and it’s great to see that the industry is moving in this direction, with more and more focus on automation rather than hiring humans to do a job that machines are so much better at.
But, most of these stories are also not very interesting because they tend to revolve around MVC-based web sites that controls both the server and the client (since the client is just the server-generated HTML) and there’s really no synchronization or backward compatibility issues between the server and the client. It’s a great place to be to not have those problems, but they are real concerns for us for reasons we’ll go into shortly.
The Netflix Way
For me, there are a number of things that make the Netflix continuous deployment story interesting and worth studying:
- Scale – more than 1000 different client devices and over a quarter of the internet traffic
- Aminator – whilst most of us try to avoid creating new AMIs when we need to deploy new versions of our code, Netflix has decided to go the other way and instead automate away the painful, manual steps involved with creating new AMIs and in return get better start-up time as their VMs comes pre-baked
- Use of Canary Deployment – dipping your toe in the water by routing a small fraction of your traffic to a canary cluster to test it out in the wild (it’s worth mentioning that this facility is also provided out-of-the-box by Google AppEngine)
- Red/Black push – a clever word play (and reference to the Netflix colour I presume?) on the classic blue-green deployment, but also making use of AWS’s auto-scaling service as well as Netflix’s very own Zuul and Asgard services for routing and deployment.
I’ve not heard any updates yet, but I’m very interested to see how the Netflix deployment pipeline has changed over the last 12 months, especially now that Docker has become widely accepted in the DevOps community. I wonder if it’s a viable alternative to baking AMIs and instead Aminator can be adopted (and renamed since it’s no longer baking AMIs) to bake Docker images instead which can then be fetched and deployed from a private repository.
If you have see any recent talks/posts that provides more up-to-date information, please feel free to share in the comments.
Need for Backward Compatibility
One interesting omission from all the Netflix articles and talks I have found so far has been how they manage backward compatibility issues between their server and client. One would assume that it must be an issue that comes up regularly whenever you introduce a big new feature or breaking changes to your API and you are not able to do a synchronous, controlled update to all your clients.
To illustrate a simple scenario that we run into regularly, let’s suppose that in a client-server setup:
- we have an iPhone/iPad client for our service which is currently version 1.0
- we want to release a new version 1.1 with brand spanking new features
- version 1.1 requires breaking changes to the service API
In the scenario outlined above, the server changes must be deployed before reviewers from Apple open up the submitted build or else they will find an unusable/unstable application that they’ll no doubt fail and put you back to square one.
Additionally, after the new version has been approved and you have marked it as available in the AppStore, it takes up to a further 4 hours before the change is propagated through the AppStore globally.
This means your new server code has to be backward compatible with the existing (version 1.0) client.
In our case, we currently operate a number of social games on Facebook and mobile (both iOS and Android devices) and each game has a complete and independent ecosystem of backend services that support all its client platforms.
Backward compatibility is an important issue for us because of scenarios such as the one above, which is further complicated by the involvement of other app stores and platforms such as Google Play and Amazon App Store.
We also found through experience that every time we force our players to update the game on their mobile devices we alienate and anger a fair chunk of our player base who will leave the game for good and occasionally leave harsh reviews along the way. Which is why even though we have the capability to force players to update, it’s a capability that we use only as a last resort. The implication being that in practice you can have many versions of clients all accessing the same backend service which has to maintain backward compatibility all the way through.
Deployment at Gamesys Social
Currently, most of our games follow this basic deployment flow:
The steps involved in releasing to production follow the basic principles of Blue-Green Deployment and although it helps eliminate downtime (since we are pushing out changes in the background whilst keeping the service running so there is no visible disruption from the client’s point-of-view) it does nothing to eliminate or reduce the need for maintaining backward compatibility.
Instead, we diligently manage backward compatibility via a combination of careful planning, communication, domain expertise and testing. Whilst it has served us well enough so far it’s hardly fool-proof, not to mention the amount of coordinated efforts required and the extra complexity it introduces to our codebase.
Having considered going down the API versioning route and the maintainability implications we decided to look for a different way, which is how we ended up with a variant of Netflix’s Red-Black deployment approach we internally refer to as..
Our Red-White Push approach takes advantage of our existing discovery mechanism whereby the client authenticates itself against a client-specific endpoint along with the client build version.
Based on the client type and version the discovery service routes the client to the corresponding cluster of game servers.
With this new flow, the earlier example might look something like this instead:
The key differences are:
- instead of deploying over existing service whilst maintaining backward compatibility, we deploy to a new cluster of nodes which will only be accessed by v1.1 clients, hence no need to support backward compatibility
- existing v1.0 clients will continue to operate and will access the cluster of nodes running old (but compatible) server code
- scale down the white cluster gradually as players update to v1.1 client
- until such time that we decide to no longer support v1.0 clients then we can safely terminate the white cluster
Despite what the name suggests, you are not actually limited to only red and white clusters. Furthermore, you can still use the aforementioned Blue-Green Deployment for releases that doesn’t introduce breaking changes (and therefore require synchronized updates to both client and server).
We’re still a long way from where we want to be and there are still lots of things in our release process that need to be improved and automated, but we have come a long way from even 12 months ago.
As one of my ex-colleagues said:
“Releases are not exciting anymore”
– Will Knox-Walker
and that is the point – making releases non-events through automation.
In our MMORPG title Here Be Monsters, we offer the players a virtual world to explore where they can visit towns and spots; forage fruits and gather insects and flowers; tend to farms and animals in their homesteads; make in-game buddies and help each other out; craft new items using things they find in their travels; catch and cure monsters corrupted by the plague; help out troubled NPCs and aid the Ministry of Monsters in its struggle against the corruption, and much more!
All and all, there are close to a hundred distinct actions that can be performed in the game and more are added as the game expands. At the very centre of everything you do in the game, is a quest and achievements system that can tap into all these actions and reward you once you’ve completed a series of requirements.
However, such a system is complicated by the snowball effect that can occur following any number of actions. The following animated GIF paints an accurate picture of a cyclic set of chain reactions that can occurred following a simple action:
In this instance,
- catching a Gnome awards EXP, gold and occasionally loot drops, in addition to fulfilling any requirement for catching a gnome;
- getting the item as loot fulfils any requirements for you to acquire that item;
- the EXP and gold awarded to the player can fulfil requirements for acquiring certain amounts of EXP or gold respective;
- the EXP can allow the player to level up;
- levelling up can then fulfil a requirement for reaching a certain level as well as unlocking new quests that were previously level-locked;
- levelling up can also award you with items and gold and the cycle continues;
- if all the requirements for a quest are fulfilled then the quest is complete;
- completing a quest will in turn yield further rewards of EXP, gold and items and restarts the cycle;
- completing a quest can also unlock follow-up quests as well as fulfilling quest-completion requirements.
The same requirements system is also in place for achievements, which represent longer term goals for players to play for (e.g. catch 500 spirit monsters). The achievement and quest systems are co-dependent and feeds into each other, many of the milestone achievements we currently have in the game depend upon quests to be completed:
Technically there is a ‘remote’ possibility of deadlocks but right now it exists only as a possibility since new quest/achievement contents are generally played through many many times by many people involved in the content generation process to ensure that they are fun, achievable and that at no point will the players be left in a state of limbo.
This cycle of chain reactions introduces some interesting implementation challenges.
For starters, the different events in the cycle (levelling up, catching a monster, completing a quest, etc.) are handled and triggered from different abstraction layers that are loosely coupled together, e.g.
- Level controller encapsulates all logic related to awarding EXP and levelling up.
- Trapping controller encapsulates all logic related to monster catching.
- Quest controller encapsulates all logic related to quest triggering, progressing and completions.
- Requirement controller encapsulates all logic related to managing the progress of requirements.
- and many more..
Functionally, the controllers form a natural hierarchy whereby higher-order controllers (such as the trapping controller) depend upon lower-order controllers (such as level controller) because they need to be able award players with EXP and items etc. However, in order to facilitate the desired flow, theoretically all controllers will need to be able to listen and react to events triggered by all other controllers..
To make matter worse, there are also non-functional requirements which also requires the ability to tap into this rich and continuous stream of events, such as:
- Analytics tracking – every action the player takes in the game is recorded along with the context in which they occurred (e.g. caught a gnome with the trap X, acquired item Z, completed quest Q, etc.)
- 3rd party reporting – notify ad partners on key milestones to help them track and monitor the effectiveness of different ad campaigns
For the components that process this stream of events, we also wanted to make sure that our implementation is:
- strongly cohesive – code that are dealing with a particular feature (quests, analytics tracking, community goals, etc.) are encapsulated within the same module
- loosely coupled – code that deals with different features should not be directly dependent on each other and where possible they should exist completely independently
Since the events are generated and processed within the context of one HTTP request (the initial action from the user), the stream also have a lifetime that is scoped to the HTTP request itself.
And finally, in terms of performance, whilst it’s not a latency critical system (generally a round-trip latency of sub-1s is acceptable) we generally aim for a response time (between request reaching the server and the server sending back a response) of 50ms to ensure a good round-trip latency from the user’s perspective.
In practice though, the last-mile latency (from your ISP to you) has proven to be the most significant factor in determining the round-trip latency.
After considering several approaches:
- Vanilla .Net events
- Reactive Extensions (Rx)
- CEP platforms such as Esper or StreamInsight
we decided to go with a tailor-made solution for the problem at hand.
In this solution we introduced two abstractions:
- Facts – which are special events for the purpose of this particular system, we call them facts in order to distinguish them from the events we record for analytics purpose already. A fact contains information about an action or a state change as well as the context in which it occurred, e.g. a CaughtMonster fact would contain information about the monster, the trap, the bait used, where in the world the action occurred, as well as the rewards the player received.
- Fact Processor – a component which processes a fact.
As a request (e.g. to check our trap to see if we’ve caught a monster) comes in the designated request handler will first perform all the relevant game logic for that particular request, accumulating facts along the way from the different abstraction layers that have to work together to process this request.
At the end of the core game logic, the accumulated facts is then forwarded to each of the configured fact processors in turn. The fact processors might choose to process or ignore each of the facts.
In choosing to process a fact the fact processors can cause state changes or other interesting events to occur which results in follow-up facts to be added to the queue.
The system described above has the benefits of being:
- Simple – easy to understand and reason with, easy to modularise, no complex orchestration logic or spaghetti code.
- Flexible – easy to change information captured by facts and processing logic in fact processors
- Extensible – easy to add new facts and/or fact processors into the system
The one big downside being that for the system to work it requires many types of facts which means it could potentially add to your maintenance overhead and requires lots of boilerplate class setup.
To address these potential issues, we turned to F#’s discriminated unions over standard .Net classes for its succinctness. For a small number of facts you can have something as simple as the following:
However, as we mentioned earlier, there are a lot of different actions that can be performed in Here Be Monsters and therefore many facts will be required to track those actions as well as the state changes that occur during those actions. The simple approach above is not a scalable solution in this case.
Instead, you could use a combination of marker interface and pattern matching to split the facts into a number of specialized discriminated union types.
Update 2014/07/28 : thank you to @johnazariah for bringing this up, the reason for choosing to use a marker interface rather than a hierarchical discriminated union in this case is because it makes interop with C# easier.
In C#, you can create the StateChangeFacts.LevelUp union clause above using the compiler generated StateChangeFacts.NewLevelUp static method but it’s not as readable as the equivalent F# code.
With a hierarchical DU the code will be even less readable, e.g. Fact.NewStateChange(StateChangeFacts.NewLevelUp(…))
To wrap things up, once all the facts are processed and we have dealt with the request in full we need to generate a response back to the client to report all the changes to the player’s state as a result of this request. To simplify the process of tracking these state changes and to keep the codebase maintainable we make use of a Context object for the current request (similar to HttpContext.Current) and make sure that each state change (e.g. EXP, energy, etc.) occurs in only one place in the codebase and that change is tracked at the point where it occurs.
At the end of each request, all the changes that has been collected is then copied from the current Context object onto the response object if it implements the relevant interface – for example, all the quest-related state changes are copied onto a response object if it implements the IHasQuestChanges interface.
In Here Be Monsters*, we have a story-driven, episodic MMORPG that has over 3500 items and 1500 quests, and with more text than the first three Harry Potter books combined – so it represented a fairly sizable challenge when we made the decision to localize the whole game!
From a technical point of view, the shear volume of words is of little consequence, although it is a significant cost concern. It’s the number of places that require localization that represents a maintainability headache.
With a conventional approach, the client application would consume a gettext file containing all the translations, and anywhere it needs to display some text it’ll substitute the original text with the localized text instead.
We found a number of issues with this approach:
- large number of files – Domain Objects/DTOs/game logic/view – need to change during implementation
- all future changes need to take localization into account
- need to replicate changes across client platforms (Flash, iOS, etc.)
- hard to get good test coverage given the scope, especially across all client platforms
- easy for regression to creep in during our frequent release cycles
- complicates and lengthens regression tests and puts more pressure on already stretched QA resources
Sounds like a dark cloud is about to take permanent residence above all our heads? It felt that way.
Instead, we decided to perform localization on the server as part of the pipeline that validates and publishes the data (quest,s achievements, items, etc.) captured in our custom CMS. The publishing process first generates domain objects that are consumable by our game servers, then converts them to DTOs for the clients.
This approach partially addresses points 3 and 4 above as it centralizes the bulk of the localization work. But it still leaves plenty of unanswered questions, the most important was the question of how to implement a solution that is:
- clean – it shouldn’t convolute our code base
- maintainable – it should be easy to maintain and hard to make mistakes even as we continue to evolve our code base
- scalable – it should continue to work well as we add more languages and localized DTO types
To answer this question, we derived a simple and yet effective solution:
- ingest the gettext translation file (the nuget package SecondLanguage comes in very handy here)
- use a PostSharp attribute to intercept string property setters on DTOs to replace input string with the localized version
- repeat for each language to generate a language specific version of the DTOs
For those of you who are not familiar with it, PostSharp is an Aspect-Oriented Programming (AOP) framework for .Net, very similar to AspectJ for Java.
Here is a simplified version of what our Localize attribute looks like:
To automatically apply localization to all present and future DTO types (assuming that all the DTO types are defined in one project), simply multicast the attribute and target all types that follows our naming convention:
[assembly: Localize(AttributeTargetTypes = “*DTO”)]
and voila, we have localized over 95% of the game with one line of code!
and here’s an example of how an almanac page in the game looks in both English and Brazilian Portuguese:
I hope you find this little story of how we localized our MMORPG interesting, and the morale of the story is really that there is much more to AOP than the same old examples you might have heard so many times before – logging, validation, etc.
With a powerful framework like PostSharp, you are able to do meta-programming on the .Net platform in a structured and disciplined way and tackle a whole range of problems that would otherwise be difficult to solve. To name a few that pops into mind:
- String interning
- Auto-implement INotifyPropertyChanged
- Auto add DataContract and DataMember attributes
- UI thread dispatching
- Performance monitoring
- Transaction handling
- Backing property with a registry value
- Making an event asynchronous
- Raise event when object is Finalized
- Dynamically introducing an interface
the list goes on, and many of these are available as part of the PostSharp pattern library too so you even get them out of the box.