Exercises in Programming Style–Lazy Rivers

NOTE : read the rest of the series, or check out the source code.

If you enjoy read­ing these exer­cises then please buy Crista’s book to sup­port her work.

exercises-prog-styles-cover

Fol­low­ing on from the last post, we will look at the Lazy Rivers style today.

 

Style 27 – Lazy Rivers

Constraints

  • Data comes to func­tions in streams, rather than as a com­plete whole all at at once.
  • Func­tions are fil­ters / trans­form­ers from one kind of data stream to anoth­er.

 

Giv­en the con­straint that data is to come in as streams, the eas­i­est way to mod­el that in F# is using sequences.

First, let’s add a func­tion to read the text from an input file as a seq<char>.

Style27_01

Then, we’ll add anoth­er func­tion to trans­form this sequence of char­ac­ters into a sequence of words.

Style27_02

Next, we’ll fil­ter this sequence to remove all the stop words and return the non-stop words as anoth­er seq<string>.

Style27_03

So far everything’s pret­ty straight­for­ward, but things get a bit tricky from here on.

To count and sort the non-stop words, we can return the run­ning count as a sequence after each word, but that’s ter­ri­bly inef­fi­cient.

Instead, we can batch the input stream into groups of 5000 words. We’ll update the word fre­quen­cies with each batch and pro­duce a new sort­ed array of word counts for the out­put sequence.

Style27_04

In the snip­pet above, focus on the Seq.scan sec­tion (ps. Seq.scan is sim­i­lar to Seq.fold, except it returns all the inter­me­di­ate val­ues for the accu­mu­la­tor, not just the final accu­mu­lat­ed val­ue).

Here, giv­en the cur­rent word fre­quen­cies map and a batch of words, we’ll return a new map (remem­ber, a F# Map is immutable) whose counts have been updat­ed by the lat­est batch.

Because Seq.scan returns results for all inter­me­di­ate steps includ­ing the ini­tial val­ue (the emp­ty map in this case), we have to fol­low up with Seq.skip 1 to exclude the emp­ty map from our out­put.

Final­ly, to string every­thing togeth­er, we’ll print the top 25 words for each of the out­puts from coun­tAnd­Sort. It takes quite a few iter­a­tions, but you’ll see the result slow­ly emerg­ing in the process.

Style27_05

Unlike the oth­er styles, we’ll get a few sets of out­puts — one for each batch of words processed.

 

You can find the source code for this exer­cise here.