CSCI.4430/6430 Programming Languages Fall 2015
Programming Assignment #2

This assignment is to be done either individually or in pairs. Do not show your code to any other group and do not look at any other group's code. Do not put your code in a public directory or otherwise make it public. However, you may get help from the TAs or the instructor. You are encouraged to use the LMS Discussions page to post problems so that other students can also answer/see the answers.

Distributed Map Reduce Framework

The goal of this assignment is to implement a Map Reduce framework using the actor model. You will then use your Map Reduce framework to calculate document word frequencies. For this assignment, you must use an actor language such as SALSA or Erlang.

Hint: First solve the problem using local concurrency (all actors in the same machine), and when your concurrent program is working, then convert your program to run in a distributed environment (actors in different machines.)

The Map Reduce Framework

The Map Reduce framework expects a user-defined map and reduce function. The map function must take a key-value pair as its argument and must return a list of key-value pairs. The reduce function must take as its argument a pair whose first item is a key and second item a list of values. The reduce function must return a pair consisting of a key followed by a list of values.

The Map Reduce framework takes the user-defined map and reduce functions and goes through three phases: a Map phase, a Shuffle phase, and a Reduce phase. In the Map phase, an input file is split into multiple key-value pairs, which are given to Mapper actors who call the user-defined map function on the provided input. In the Shuffle phase, the Mappers sends the map results to Shuffler actors, whose job is to aggregate the given list of key-value pairs into a key along with a list of values that correspond to the key. Finally, in the Reduce phase, each Reducer actor is sent a key along with a list of values and performs reduce to generate another list of values. In the end, the reducer results are written to an output file.

Here is a figure illustrating a DNA sequence character counting problem solved with the Map Reduce framework:

Figure source: Developing Elastic Software for the Cloud, Encyclopedia on Cloud Computing. Wiley, 2015.

I/O Specification

For this assignment, we will assume that the input and output of the Map-Reduce framework are both text files. The input file contains a list of key-value pairs, each on its own line with the key and the value(s) separated with a single tab ('\t'). The output file is to contain a list of reduced results, each on its own line consisting of the key, then a tab, then the list of values separated with single tabs. See a sample input file and a sample output file for the illustrated DNA sequence character counting problem.

Document Word Frequency Calculation using Map Reduce

After implementing the Map Reduce framework, you will use it to solve the document word frequency calculation problem. The input file contains a list of key-value pairs, each of which contains a key that is the name of the document and values which are the words. Words exclusively consist of alphabetical characters. The output file should contain a list of word frequency lines in the following format:

[word][tab]([doc1],[wf1])[tab]([doc2],[wf2]) ...
that is, the word, followed by a tab ('\t'), followed by a list of document-word frequency pairs separated with single tabs, each pair is put in a pair of parentheses and has a comma in between. See a sample input and a sample output file.

Please clearly specify in your README file how to run your distributed framework, including how to specify map and reduce functions, the machines where actors are to run, and how to change the number of actors in your framework, if deviating from the language-specific instructions below.

Notes for SALSA Programmers

A startup package for the Map Reduce framework is given in pa2start.zip, which contains a sample usage of the framework (once it is implemented) for the sample DNA sequence character counting problem. Make sure your Map Reduce framework works with this problem before trying the word frequency problem.

Your concurrent program should be run in the following manner:

$ salsac mapreduce/*
$ salsa mapreduce.MapReduceProblem [input] [output] [mappers] [shufflers] [reducers]
and your distributed program should be run in the following manner:
$ salsac mapreduce/*
$ salsa mapreduce.MapReduceProblemD [input] [output] [mappers] [shufflers] [reducers] [theaters]
where input specifies an input file name, output specifies an output file name, mappers, shufflers, and reducers specify the number of actors to use for each map reduce phase, and theaters is a name server and theater description file. See a sample name server and theaters description file, the first line of which specifies the name server location and each of the remaining lines specifies a theater location.

salsac and salsa are UNIX aliases or Windows batch scripts that run java and javac with the expected arguments: See .cshrc for UNIX, and salsac.bat salsa.bat for Windows.

Time Saving Hints

  1. For reference, please see the SALSA webpage, including its FAQ. Read the tutorial and a comprehensive example illustrating distributed programming in SALSA.
  2. To run the distributed program, first, run the name server and the theaters:
    [host0:dir0]$ wwcns [port number 0]
    [host1:dir1]$ wwctheater [port number 1]
    [host2:dir2]$ wwctheater [port number 2]
    ...
    
    where wwcns and wwctheater are UNIX aliases or Windows batch scripts: See .cshrc for UNIX, and wwcns.bat wwctheater.bat for Windows.
  3. Make sure that the theaters are run where the actor behavior code is available, that is, the mapreduce directory should be visible in directories: host1:dir1 and host2:dir2. Then, run the distributed program as mentioned above.
  4. The theaters all cache actor behaviors. Restart all the theaters each time changes are made to the code.
  5. The module/behavior names in SALSA must match the directory/file hierarchical structure in the file system. e.g., the MapReduceProblem behavior should be in a relative path mapreduce/MapReduceProblem.salsa, and should start with the line module mapreduce;.
  6. Messaging is asynchronous. m1(...);m2(...); does not imply m1 occurs before m2.
  7. Notice that in the code m(...)@n(...);, n is processed after m is executed, but not necessarily after messages sent inside m are executed. For example, if inside m, messages m1 and m2 are sent, in general, n could happen before m1 and m2.
  8. (Named) tokens can only be used as arguments to messages.

Notes for Erlang Programmers

Your implementation must have a start function that takes the following arguments:

  1. Input Filename
  2. Output Filename
  3. Map function
  4. Reduce function
  5. Number of Mappers
  6. Number of Shufflers
  7. Number of Reducers
  8. List of Node names
This start function should initiate the Map Reduce process.

Your start function should be called, for example, as follows:

start("input_file.txt","output_file.txt",fun dna_char_count:map/1,
      fun dna_char_count:reduce/1,4,2,2,
      ['workplace1@127.0.0.1','workplace2@127.0.0.1','workplace3@127.0.0.1'])

In this example, dna_char_count is the problem-specific module that implements map and reduce and the 1 represents the number of input arguments for both map and reduce. Your problem-specific map and reduce functions should be implemented in a separate module from the map-reduce framework.

Below is a sample implementation of map and reduce for counting characters in a DNA sequence:
-module(dna_char_count).
-export([map/1,reduce/1]).
map({_,Value}) -> lists:map(fun(X) -> {X,1} end, Value).
reduce({Key,Values}) -> {Key,[lists:foldl(fun(V,Sum) -> Sum + V end, 0, Values)]}.

The above example can be found here.

Time Saving Hints

List of useful modules and functions:

When writing distributed Erlang code, you will need to create nodes with a shared cookie and establish communication between nodes. The shared cookie is just an atom that is used to restrict communication to nodes having the same cookie.net_kernel:connect_node/1 can be used to establish communication between nodes. It is possible to use spawn/4 to create new actors on a remote node.

To start a node named workplace1@127.0.0.1 with cookie mapreduce, you may use the command:

$ erl -noshell name workplace1@127.0.0.1 -setcookie mapreduce


Due Date: Thursday, 10/29, 7:00PM

Grading: The assignment will be graded mostly on correctness, but code clarity / readability will also be a factor (comment, comment, comment!).

Submission Requirements: Please submit a ZIP file with your code, including a README file. Your ZIP file should be named with your LMS user name(s) as the filename, either userid1.zip or userid1_userid2.zip. Only submit one assignment per pair via LMS. In the README file, place the names of each group member (up to two). Your README file should also have a list of specific features / bugs in your solution.