Exercises in Programming Style–Map Reduce

NOTE : read the rest of the series, or check out the source code.

If you enjoy read­ing these exer­cises then please buy Crista’s book to sup­port her work.

exercises-prog-styles-cover

Fol­low­ing on from the last post, we will look at the Map Reduce style today.

 

Style 30 – Map Reduce

Constraints

  • Input data is divid­ed in chunks, sim­i­lar to what an inverse mul­ti­plex­er does to input sig­nals.
  • A map func­tion applies a giv­en work­er func­tion to each chunk of data, poten­tial­ly in par­al­lel.
  • A reduce func­tion takes the results of the many work­er func­tions and recom­bines them into a coher­ent out­put.

 

Map reduce should be a famil­iar par­a­digm to most devel­op­ers nowa­days. In today’s style we’ll use a very sim­ple map-reduce flow to solver the term fre­quen­cy prob­lem.

Style30-flow-3

 

As you can see from the above dia­gram, we’ll take an input text and split it into X no. of chunks which can be processed in par­al­lel. To do this, we’ll first add a par­ti­tion func­tion below.

Style30-01

Each chunk of the orig­i­nal input text can then be split and fil­tered (using the list of stop words) inde­pen­dent­ly.

Two things to note about the split­Words func­tion below:

a) it returns an array of (word, count) tuples

b) it doesn’t per­form any local aggre­ga­tion, so if the same word appear twice in its chunk then it’ll appear as (word, 1) twice in the returned array

Style30-02

Once all the chunks has been processed, we can reduce over them with the count­Words func­tion below.

Style30-03

Here, I have tak­en an imper­a­tive approach for effi­cien­cy sake, and because that’s what Crista had done in her Python solu­tion. Alter­na­tive­ly, you could have writ­ten some­thing along the lines of the fol­low­ing.

Style30-03v2

which is much more con­cise and func­tion­al in style, but about 25% slow­er (~100ms) than the imper­a­tive approach.

Next, we’ll add a sort func­tion that’ll sort the array of (word, count) val­ues in descend­ing order by the count (i.e. the snd ele­ment in the tuple).

Style30-04

And final­ly, we wire every­thing togeth­er using pipes.

Style30-05

Job done!

 

or is it?

So far, I have fol­lowed Crista’s approach in her Python solu­tion, and the solu­tion meets the con­straints of this style. But, we haven’t tak­en advan­tage of the poten­tial for par­al­lelism here.

 

Version 2 — Parallel Processing

To make the pro­cess­ing of the chunks hap­pen in par­al­lel, let’s mod­i­fy the split­Words func­tion to make it async.

Style30-06

Notice that not much has changed from our non-async ver­sion! Except for the minor opti­miza­tion we added to aggre­gate the words in the chunk instead of return­ing dupli­cat­ed (word, 1) tuples.

The next thing we need to do, is to wrap the orig­i­nal pipeline in an async work­flow.

Here, once the par­ti­tions have been cre­at­ed, we’ll process them in par­al­lel and wait for all the par­al­lel pro­cess­ing to com­plete using Async.Parallel before pro­ceed­ing with the rest of the pipeline as before.

Style30-07

 

You can find the source code for this exer­cise here (v1) and here (v2 — async).