Exercises in Programming Style–Map Reduce

You can become a serverless blackbelt. Enrol to my 4-week online workshop Production-Ready Serverless and gain hands-on experience building something from scratch using serverless technologies. At the end of the workshop, you should have a broader view of the challenges you will face as your serverless architecture matures and expands. You should also have a firm grasp on when serverless is a good fit for your system as well as common pitfalls you need to avoid. Sign up now and get 15% discount with the code yanprs15!

NOTE : read the rest of the series, or check out the source code.

If you enjoy read­ing these exer­cises then please buy Crista’s book to sup­port her work.

exercises-prog-styles-cover

Following on from the last post, we will look at the Map Reduce style today.

 

Style 30 – Map Reduce

Constraints

  • Input data is divided in chunks, similar to what an inverse multiplexer does to input signals.
  • A map function applies a given worker function to each chunk of data, potentially in parallel.
  • A reduce function takes the results of the many worker functions and recombines them into a coherent output.

 

Map reduce should be a familiar paradigm to most developers nowadays. In today’s style we’ll use a very simple map-reduce flow to solver the term frequency problem.

Style30-flow-3

 

As you can see from the above diagram, we’ll take an input text and split it into X no. of chunks which can be processed in parallel. To do this, we’ll first add a partition function below.

Style30-01

Each chunk of the original input text can then be split and filtered (using the list of stop words) independently.

Two things to note about the splitWords function below:

a) it returns an array of (word, count) tuples

b) it doesn’t perform any local aggregation, so if the same word appear twice in its chunk then it’ll appear as (word, 1) twice in the returned array

Style30-02

Once all the chunks has been processed, we can reduce over them with the countWords function below.

Style30-03

Here, I have taken an imperative approach for efficiency sake, and because that’s what Crista had done in her Python solution. Alternatively, you could have written something along the lines of the following.

Style30-03v2

which is much more concise and functional in style, but about 25% slower (~100ms) than the imperative approach.

Next, we’ll add a sort function that’ll sort the array of (word, count) values in descending order by the count (i.e. the snd element in the tuple).

Style30-04

And finally, we wire everything together using pipes.

Style30-05

Job done!

 

or is it?

So far, I have followed Crista’s approach in her Python solution, and the solution meets the constraints of this style. But, we haven’t taken advantage of the potential for parallelism here.

 

Version 2 – Parallel Processing

To make the processing of the chunks happen in parallel, let’s modify the splitWords function to make it async.

Style30-06

Notice that not much has changed from our non-async version! Except for the minor optimization we added to aggregate the words in the chunk instead of returning duplicated (word, 1) tuples.

The next thing we need to do, is to wrap the original pipeline in an async workflow.

Here, once the partitions have been created, we’ll process them in parallel and wait for all the parallel processing to complete using Async.Parallel before proceeding with the rest of the pipeline as before.

Style30-07

 

You can find the source code for this exercise here (v1) and here (v2 – async).

Liked this article? Support me on Patreon and get direct help from me via a private Slack channel or 1-2-1 mentoring.
Subscribe to my newsletter


Hi, I’m Yan. I’m an AWS Serverless Hero and I help companies go faster for less by adopting serverless technologies successfully.

Are you struggling with serverless or need guidance on best practices? Do you want someone to review your architecture and help you avoid costly mistakes down the line? Whatever the case, I’m here to help.

Hire me.


Skill up your serverless game with this hands-on workshop.

My 4-week Production-Ready Serverless online workshop is back!

This course takes you through building a production-ready serverless web application from testing, deployment, security, all the way through to observability. The motivation for this course is to give you hands-on experience building something with serverless technologies while giving you a broader view of the challenges you will face as the architecture matures and expands.

We will start at the basics and give you a firm introduction to Lambda and all the relevant concepts and service features (including the latest announcements in 2020). And then gradually ramping up and cover a wide array of topics such as API security, testing strategies, CI/CD, secret management, and operational best practices for monitoring and troubleshooting.

If you enrol now you can also get 15% OFF with the promo code “yanprs15”.

Enrol now and SAVE 15%.


Check out my new podcast Real-World Serverless where I talk with engineers who are building amazing things with serverless technologies and discuss the real-world use cases and challenges they face. If you’re interested in what people are actually doing with serverless and what it’s really like to be working with serverless day-to-day, then this is the podcast for you.


Check out my new course, Learn you some Lambda best practice for great good! In this course, you will learn best practices for working with AWS Lambda in terms of performance, cost, security, scalability, resilience and observability. We will also cover latest features from re:Invent 2019 such as Provisioned Concurrency and Lambda Destinations. Enrol now and start learning!


Check out my video course, Complete Guide to AWS Step Functions. In this course, we’ll cover everything you need to know to use AWS Step Functions service effectively. There is something for everyone from beginners to more advanced users looking for design patterns and best practices. Enrol now and start learning!