CSCI-4430/6969 Programming Languages Spring 2013
Programming Assignment #2

This assignment is to be done either individually or in pairs. Do not show your code to any other group and do not look at any other group's code. Do not put your code in a public directory or otherwise make it public. However, you may get help from the TAs or the instructor. You are encouraged to use the RPILMS Discussions page to post problems so that other students can also answer/see the answers.

MapReduce

MapReduce is a programming model for processing large data sets using a large number of computers (nodes). It has two steps:

  Map: The master node divides the problem into smaller sub-problems and distributes them to worker nodes. A worker node processes the smaller problem and passes the answer back to the master node.

  Reduce: The master node collects the answers to all the sub-problems and combines them in some way to form the output - the answer to the original problem.

The user specifies a map function that takes a (key, value) pair to generate zero or more intermediate (key, value) pairs, and a reduce function that takes all intermediate values associated with the same key to generate zero or more outputs:

  map(key1, value1) -> list(key2, value2)

  reduce(key2, list(value2)) -> list(value3)

MapReduce is useful in a wide range of applications and has been adapted to several computing environments. At Google, MapReduce was used to completely regenerate Google's index of the World Wide Web. Read http://en.wikipedia.org/wiki/MapReduce and http://wiki.apache.org/hadoop/MapReduce for more information. If you are interested, read http://research.google.com/archive/mapreduce.html.

The assignment is to implement a simple sequential MapReduce engine in Oz and use it to solve several problems.

Part 1 (60%). Write a simple MapReduce engine.

Your MapReduce engine is a function {MapReduce Map Compare Reduce Input}, which has four arguments: Map is the map function specified by the user, Compare is a comparison function specified by the user, Reduce is the reduce function specified by the user, and Input is a list of (key, value) pairs in the form [[k1 v1] [k2 v2] [k3 v3] ...], so that a (key, value) pair is represented as a list [k v]. It is also required that function {Map Key Value} has two arguments and returns a list of (key, value) pairs in the form [[k1 v1] [k2 v2] [k3 v3] ...], function {Compare Key1 Key2} has two arguments and returns true if Key1 <= Key2 or false if Key1 > Key2, and function {Reduce Key Values} has two arguments and returns a list of values. The functionality of MapReduce can be summarized as follows:

Example 1. If Map, Compare and Reduce are defined as:

declare
fun {Map Key Value}
   [[Key Value]]
end
fun {Compare Key1 Key2}
   Key1=<Key2
end
fun {Reduce Key Values}
   [[Key Values]]
end

Then {MapReduce Map Compare Reduce [[2 a] [3 b] [1 c] [3 d] [1 a] [4 b] [2 c] [1 d]]} returns [[1 [c a d]] [2 [a c]] [3 [b d]] [4 [b]]], a sorted list of keys and their associated lists of values.

Example 2. If Map, Compare and Reduce are defined as:

declare
fun {Map Key Value}
   case Value
   of [X] then [[X 1]]
   [] X|Xr then
      [X 1]|{Map Key Xr}
   end
end
fun {Compare Key1 Key2}
   Key1=<Key2
end
fun {Reduce Key Values}
   case Values
   of [X] then [[Key X]]
   [] X|Xr then
      [[Key X+{Reduce Key Xr}.1.2.1]]
   end
end

Then {MapReduce Map Compare Reduce [[a [3 5 2 1 4 6 5 2]] [b [2 5 2 3]]]} returns [[1 1] [2 4] [3 2] [4 1] [5 3] [6 1]], the number of occurrences of each unique value in the lists a and b.

Part 2 (40%). Solve problems using the MapReduce engine.

Below is the sonnet On His Blindness by Milton:

WHEN I consider how my light is spent
  E're half my days, in this dark world and wide,
  And that one Talent which is death to hide,
  Lodg'd with me useless, though my Soul more bent
To serve therewith my Maker, and present
  My true account, least he returning chide,
  Doth God exact day-labour, light deny'd,
  I fondly ask; But patience to prevent
That murmur, soon replies, God doth not need
  Either man's work or his own gifts, who best
  Bear his milde yoak, they serve him best, his State
Is Kingly. Thousands at his bidding speed
  And post o're Land and Ocean without rest:
  They also serve who only stand and waite.

Which can be represented as a list of line number/string pairs:

declare
Input = [[1 "WHEN I consider how my light is spent"]
         [2 "  E're half my days, in this dark world and wide,"]
         [3 "  And that one Talent which is death to hide,"]
         [4 "  Lodg'd with me useless, though my Soul more bent"]
         [5 "To serve therewith my Maker, and present"]
         [6 "  My true account, least he returning chide,"]
         [7 "  Doth God exact day-labour, light deny'd,"]
         [8 "  I fondly ask; But patience to prevent"]
         [9 "That murmur, soon replies, God doth not need"]
         [10 "  Either man's work or his own gifts, who best"]
         [11 "  Bear his milde yoak, they serve him best, his State"]
         [12 "Is Kingly. Thousands at his bidding speed"]
         [13 "  And post o're Land and Ocean without rest:"]
         [14 "  They also serve who only stand and waite."]]

Problem 1. One useful text processing function is to search for a substring. Write a set of Map, Compare and Reduce functions to search for the words and associated line numbers containing a specified substring (without spaces). You can use a global variable to store the search string, for example, declare SearchString = "is". The return value of {MapReduce Map Compare Reduce Input} is like [[is [1 3]] [his [10 11 11 12]] [this [2]]], with the words sorted by increasing length and atomized for better display (StringToAtom). Note that punctuation marks (, ; . :) are not part of a word.

Problem 2. Various statistics are related to the readability of a document. Write a set of Map, Compare and Reduce functions to calculate the number of characters (with spaces), characters per word, number of words, and words per line for the input. The return value of {MapReduce Map Compare Reduce Input} is [['Characters' ??] ['Characters per Word' ??] ['Words' ??] ['Words per Line' ??]], where ?? is an integer or float. For the sonnet, the result is [['Characters' 617] ['Characters per Word' 4.2478] ['Words' 113] ['Words per Line' 8.0714]].

Extra Credit (up to 25% bonus).

(a) Find a nontrivial problem suitable for the MapReduce model and solve it using the simple MapReduce engine.

(b) Implement a parallel MapReduce engine in Oz.

Due Date:

Received Time Grade Modification
before Monday, 03/04, 11:59PM +10%
before Tuesday, 03/05, 11:59PM no modification (on time)
before Wednesday, 03/06, 11:59PM -10%
before Friday, 03/08, 11:59PM -25%
after Saturday, 03/09, 12:00AM not accepted

Grading: The assignment will be graded mostly on correctness, but code clarity / readability will also be a factor (comment, comment, comment!). See the professor or TAs, if you have ideas for other extensions for this assignment and would like extra credit for implementing them.

Submission Requirements: Please submit a ZIP file with your code, including a README file. In the README file, place the names of each group member (up to two). Your README file should also have a list of specific features / bugs in your solution. Your ZIP file should be named with your RPILMS user name(s) as the filename, either userid1.zip or userid1_userid2.zip. Only submit one assignment per pair via RPILMS.