This assignment is to be done either individually or in pairs. Do not show your code to any other group and do not look at any other group's code. Do not put your code in a public directory or otherwise make it public. However, you may get help from the TAs or the instructor. You are encouraged to use the RPILMS Discussions page to post problems so that other students can also answer/see the answers.
MapReduce is a programming model for processing large data sets using a large number of computers (nodes). It has two steps:
Map: The master node divides the problem into smaller sub-problems and distributes them to worker nodes. A worker node processes the smaller problem and passes the answer back to the master node.
Reduce: The master node collects the answers to all the sub-problems and combines them in some way to form the output - the answer to the original problem.
The user specifies a map function that takes a (key, value) pair to generate zero or more intermediate (key, value) pairs, and a reduce function that takes all intermediate values associated with the same key to generate zero or more outputs:
map(key1, value1) -> list(key2, value2)
reduce(key2, list(value2)) -> list(value3)
MapReduce is useful in a wide range of applications and has been adapted to several computing environments. At Google, MapReduce was used to completely regenerate Google's index of the World Wide Web. Read http://en.wikipedia.org/wiki/MapReduce and http://wiki.apache.org/hadoop/MapReduce for more information. If you are interested, read http://research.google.com/archive/mapreduce.html.
The assignment is to implement a simple sequential MapReduce engine in Oz and use it to solve several problems.
Your MapReduce engine is a function {MapReduce Map Compare Reduce Input}
, which has four arguments: Map
is the map function specified by the user, Compare
is a comparison function specified by the user, Reduce
is the reduce function specified by the user, and Input
is a list of (key, value) pairs in the form [[k1 v1] [k2 v2] [k3 v3] ...]
, so that a (key, value) pair is represented as a list [k v]
. It is also required that function {Map Key Value}
has two arguments and returns a list of (key, value) pairs in the form [[k1 v1] [k2 v2] [k3 v3] ...]
, function {Compare Key1 Key2}
has two arguments and returns true
if Key1 <= Key2
or false
if Key1 > Key2
, and function {Reduce Key Values}
has two arguments and returns a list of values. The functionality of MapReduce
can be summarized as follows:
Map
with each (key, value) pair in Input
.Map
calls.Compare
.
[[2 1] [1 5] [2 3] [3 4] [1 2]]
may become [[1 [5 2]] [2 [1 3]] [3 [4]]]
after sorting and grouping.Reduce
with each key and its associated list of values in the sorted order.Reduce
calls into a single list, as the return value of MapReduce
.Example 1. If Map
, Compare
and Reduce
are defined as:
declare fun {Map Key Value} [[Key Value]] end fun {Compare Key1 Key2} Key1=<Key2 end fun {Reduce Key Values} [[Key Values]] end
Then {MapReduce Map Compare Reduce [[2 a] [3 b] [1 c] [3 d] [1 a] [4 b] [2 c] [1 d]]}
returns [[1 [c a d]] [2 [a c]] [3 [b d]] [4 [b]]]
, a sorted list of keys and their associated lists of values.
Example 2. If Map
, Compare
and Reduce
are defined as:
declare fun {Map Key Value} case Value of [X] then [[X 1]] [] X|Xr then [X 1]|{Map Key Xr} end end fun {Compare Key1 Key2} Key1=<Key2 end fun {Reduce Key Values} case Values of [X] then [[Key X]] [] X|Xr then [[Key X+{Reduce Key Xr}.1.2.1]] end end
Then {MapReduce Map Compare Reduce [[a [3 5 2 1 4 6 5 2]] [b [2 5 2 3]]]}
returns [[1 1] [2 4] [3 2] [4 1] [5 3] [6 1]]
, the number of occurrences of each unique value in the lists a
and b
.
Below is the sonnet On His Blindness by Milton:
WHEN I consider how my light is spent E're half my days, in this dark world and wide, And that one Talent which is death to hide, Lodg'd with me useless, though my Soul more bent To serve therewith my Maker, and present My true account, least he returning chide, Doth God exact day-labour, light deny'd, I fondly ask; But patience to prevent That murmur, soon replies, God doth not need Either man's work or his own gifts, who best Bear his milde yoak, they serve him best, his State Is Kingly. Thousands at his bidding speed And post o're Land and Ocean without rest: They also serve who only stand and waite.
Which can be represented as a list of line number/string pairs:
declare Input = [[1 "WHEN I consider how my light is spent"] [2 " E're half my days, in this dark world and wide,"] [3 " And that one Talent which is death to hide,"] [4 " Lodg'd with me useless, though my Soul more bent"] [5 "To serve therewith my Maker, and present"] [6 " My true account, least he returning chide,"] [7 " Doth God exact day-labour, light deny'd,"] [8 " I fondly ask; But patience to prevent"] [9 "That murmur, soon replies, God doth not need"] [10 " Either man's work or his own gifts, who best"] [11 " Bear his milde yoak, they serve him best, his State"] [12 "Is Kingly. Thousands at his bidding speed"] [13 " And post o're Land and Ocean without rest:"] [14 " They also serve who only stand and waite."]]
Problem 1. One useful text processing function is to search for a substring. Write a set of Map
, Compare
and Reduce
functions to search for the words and associated line numbers containing a specified substring (without spaces). You can use a global variable to store the search string, for example, declare SearchString = "is"
. The return value of {MapReduce Map Compare Reduce Input}
is like [[is [1 3]] [his [10 11 11 12]] [this [2]]]
, with the words sorted by increasing length and atomized for better display (StringToAtom
). Note that punctuation marks (, ; . :) are not part of a word.
Problem 2. Various statistics are related to the readability of a document. Write a set of Map
, Compare
and Reduce
functions to calculate the number of characters (with spaces), characters per word, number of words, and words per line for the input. The return value of {MapReduce Map Compare Reduce Input}
is [['Characters' ??] ['Characters per Word' ??] ['Words' ??] ['Words per Line' ??]]
, where ??
is an integer or float. For the sonnet, the result is [['Characters' 617] ['Characters per Word' 4.2478] ['Words' 113] ['Words per Line' 8.0714]]
.
(a) Find a nontrivial problem suitable for the MapReduce model and solve it using the simple MapReduce engine.
(b) Implement a parallel MapReduce engine in Oz.
Received Time | Grade Modification |
before Monday, 03/04, 11:59PM | +10% |
before Tuesday, 03/05, 11:59PM | no modification (on time) |
before Wednesday, 03/06, 11:59PM | -10% |
before Friday, 03/08, 11:59PM | -25% |
after Saturday, 03/09, 12:00AM | not accepted |
Grading: The assignment will be graded mostly on correctness, but code clarity / readability will also be a factor (comment, comment, comment!). See the professor or TAs, if you have ideas for other extensions for this assignment and would like extra credit for implementing them.
Submission Requirements: Please submit a ZIP file with your code, including a README file. In the README file, place the names of each group member (up to two). Your README file should also have a list of specific features / bugs in your solution. Your ZIP file should be named with your RPILMS user name(s) as the filename, either userid1.zip or userid1_userid2.zip. Only submit one assignment per pair via RPILMS.