Exercises in Programming Style–Map Reduce

NOTE : read the rest of the series, or check out the source code.

If you enjoy read­ing these exer­cises then please buy Crista’s book to sup­port her work.


Fol­low­ing on from the last post, we will look at the Map Reduce style today.


Style 30 – Map Reduce


  • Input data is divid­ed in chunks, sim­i­lar to what an inverse mul­ti­plex­er does to input sig­nals.
  • A map func­tion applies a giv­en work­er func­tion to each chunk of data, poten­tial­ly in par­al­lel.
  • A reduce func­tion takes the results of the many work­er func­tions and recom­bines them into a coher­ent out­put.


Map reduce should be a famil­iar par­a­digm to most devel­op­ers nowa­days. In today’s style we’ll use a very sim­ple map-reduce flow to solver the term fre­quen­cy prob­lem.



As you can see from the above dia­gram, we’ll take an input text and split it into X no. of chunks which can be processed in par­al­lel. To do this, we’ll first add a par­ti­tion func­tion below.


Each chunk of the orig­i­nal input text can then be split and fil­tered (using the list of stop words) inde­pen­dent­ly.

Two things to note about the split­Words func­tion below:

a) it returns an array of (word, count) tuples

b) it doesn’t per­form any local aggre­ga­tion, so if the same word appear twice in its chunk then it’ll appear as (word, 1) twice in the returned array


Once all the chunks has been processed, we can reduce over them with the count­Words func­tion below.


Here, I have tak­en an imper­a­tive approach for effi­cien­cy sake, and because that’s what Crista had done in her Python solu­tion. Alter­na­tive­ly, you could have writ­ten some­thing along the lines of the fol­low­ing.


which is much more con­cise and func­tion­al in style, but about 25% slow­er (~100ms) than the imper­a­tive approach.

Next, we’ll add a sort func­tion that’ll sort the array of (word, count) val­ues in descend­ing order by the count (i.e. the snd ele­ment in the tuple).


And final­ly, we wire every­thing togeth­er using pipes.


Job done!


or is it?

So far, I have fol­lowed Crista’s approach in her Python solu­tion, and the solu­tion meets the con­straints of this style. But, we haven’t tak­en advan­tage of the poten­tial for par­al­lelism here.


Version 2 — Parallel Processing

To make the pro­cess­ing of the chunks hap­pen in par­al­lel, let’s mod­i­fy the split­Words func­tion to make it async.


Notice that not much has changed from our non-async ver­sion! Except for the minor opti­miza­tion we added to aggre­gate the words in the chunk instead of return­ing dupli­cat­ed (word, 1) tuples.

The next thing we need to do, is to wrap the orig­i­nal pipeline in an async work­flow.

Here, once the par­ti­tions have been cre­at­ed, we’ll process them in par­al­lel and wait for all the par­al­lel pro­cess­ing to com­plete using Async.Parallel before pro­ceed­ing with the rest of the pipeline as before.



You can find the source code for this exer­cise here (v1) and here (v2 — async).