Assignment 6 information

Assignment 6 information
Notes on the assignment
Documentation & other information
Discretization
Submit Assignment 6 to the web testers
Handling missing attributes example (updated 11/18 5pm)

Assignment 6 information

Errata & announcements

[11/19] Web testers are up!
[11/19] New version of a6code.com released as described below.
[11/18] I added an item at the bottom of the "Handling missing attributes example" that describes a special case you need to deal with in your missing-learn-dtree procedure. It describes two specific modifications you can make to handle this case.
[11/18] Note that in the class survey data set, there are some missing values for continuous-valued attributes. Your create-X-discretization procedures should ignore these values. (I will need to fix the support code procedure discretize-tdata to handle this properly.)
[11/18] The discretize-tdata procedure uses the < (less than) comparison to compare a value to the threshold. Therefore, if the division should be between integer values 5 and 8, you should not return a threshold of 5 because it will improperly classify examples with value 5.
[11/18] For the create-multi-discretization, you should come up with thresholds that do better than the single threshold for create-binary-discretization.
In problem 5, the assignment should read "...discretizes a single continuous attribute into an attribute with three values...". It should not give you the option of discretizing it into more than three values.
I am making a change to problems 4 and 5 that will actually make the problems somewhat easier for you.
Instead of returning a procedure, they will just return threshold values. create-binary-discretization should return a number, and create-multi-discretization should return a list of two numbers (with the lower threshold first).
One other change: these procedures will no longer take a values argument. I have changed the assign6.scm stubs file above to reflect this.
See the Discretization section for more details and examples.

Support code

Stubs file: assign6.scm (revised 11/13 b/c of change described above)
Support code: a6code.com (version 1.4.2, released 11/19) This version corrects the discretize-tdata procedure so it will not change continuous-valued attributes that are missing. It also now prints information on the threshold(s) that your procedure selects.
version 1.4.1, released 11/13: This version includes the procedure discretize-tdata, described in the discretization section below. There are no other changes to the support code.

Data sets

Example data sets: a6data-1.scm
Class survey data set: survey.scm
Mushroom data set: mushroom-data.scm
Voting data set: voting-data.scm
Breast cancer data set bc-data.scm

Note that the last three data sets actually define different data sets for that domain. To properly gauge the accuracy of the decision tree, you should learn the decision tree using one data set and test it using a different data set.

Problem 6 Writeup

There are several parts to the writeup for Problem 6:

Run your learn-dtree procedure using the mushroom data set. Learn a decision tree using mushroom-data1 and test it using mushroom-data2. How well does your decision tree do on this data set? (Report the number and percent of examples correctly classified when you test your decision tree.)
Run your missing-learn-dtree procedure using the voting data set. Learn a decision tree using voting-data1 and test it using voting-data2. Report the number and percent of examples correctly classified by your decision tree.
Take the class survey data set, and turn it into training data with "Color" as the goal predicate. Then use your create-binary-discretization procedure to discretize the continuous-valued attributes of the data set. Report the threshold used for discretizing each attribute. See the Discretization section below for some hints on doing this.
Learn a decision tree from the discretized version of the data set using your learn-dtree procedure. Include the resulting decision tree in your writeup.
Describe how your create-multi-discretization procedure chooses the two thresholds it returns.
Use your create-multi-discretization procedure to do the same as above (for your create-binary-discretization procedure in part 3) on the class survey data set.
Use other attributes (besides "Color") from the class survey data set as goal predicates, discretize the continuous-valued attributes (using either binary- or multi- discretization), and learn a decision tree. Find some interesting results on how to predict certain attributes (as a goal predicate) from other attribute values. Be sure to report the thresholds used in your discretizations and show the resulting decision tree you learned.

Notes on the assignment

Decision trees are covered in Chapter 18 of our text
Missing values will always be indicated with a ?.
To handle examples with a missing attribute value, we will "split" those examples among all the attribute values in the data at that node. The example I gave in class was something like the following. Initially (at the top level) all examples have a weight of 1. Suppose we have the following data set:
```
Weight | Color | Goal
---------------------
  1    | Red   | No
  1    | Green | No
  1    | ?     | Yes
  1    | Green | Yes
  1    | Blue  | Yes
```
Note that the color "red" is 1/4 of the attribute values, "green" is 1/2, and "blue" is 1/4.
If we were to split on the attribute "color", then the examples recursively propogated to the three subtrees are (respectively):
```
Weight | Color | Goal     Weight | Color | Goal     Weight | Color | Goal   
---------------------     ---------------------     ---------------------   
  1    | Red   | No         1    | Green | No         1    | Blue  | Yes    
  0.25 | ?     | Yes        0.5  | ?     | Yes        0.25 | ?     | Yes    
                            1    | Green | Yes        
```
For discretizing a continuous-valued attribute, the approach we discussed in class was to sort the training data by that attribute. Setting a threshold in a place where the goal classification changes will maximize the information gain from that attribute. You should pick one of those dividing lines; however, there is nothing to do but to try all those candidate thresholds.
For Problem 5, I am asking you to come up with your own way of discretizing an attribute into three discrete values instead of two.

Documentation & other information

Basic support code — also documented in the assignment handout
Weighted training example — you'll need these for problem 3
Useful built-in Scheme procedures

Basic support code

These are procedures that are described in the assignment handout, perhaps with more examples.

(split-tdata training-data attribute-names attribute)

Divide training data into groups according to the specified attribute. For example:

> loan-names
;Value: (house bills income credit)

> loan-data-small
;Value: ((no  (own late high good))
;        (yes (own on-time medium good))
;        (no  (own on-time high good))
;        (no  (own on-time high good)))

> (split-tdata loan-data-small loan-names 'bills)
;Value: ((on-time ((no (own on-time high good))
;                  (no (own on-time high good))
;                  (yes (own on-time medium good)))) 
;        (late    ((no (own late high good)))))

The order of the attribute values is not guaranteed.

Note that this procedure will also work with weighted training data.

(tally-tdata training-data)
Given a list of training data, it counts how many occurrences there are for each goal predicate value. For example:
```
> (tally-tdata loan-data-small)
Value: ((yes 1) (no 3))
```
The order of the goal predicate values is not guaranteed.
Note that this procedure will also work with weighted training data

(classify example decision-tree attribute-names default-value)

Returns a classification of the example using the given decision-tree. If the example has an attribute value not found in the decision tree, the default-value is returned.

For example:

(define loan-dtree-example
  '(income (high yes)
           (low (house (rent  no)
                       (own yes)))))

> (classify '(rent late high good) loan-dtree-example loan-names 'No)
;Value: yes

> (classify '(rent late medium good) loan-dtree-example loan-names 'No)
;Value: no

(test-dtree decision-tree training-data attribute-names)

Takes a training data set, strips the goal attribute out to create a list of examples which it then classifies using the given decision tree. The classifications returned by the decision tree are compared against the given goal attribute for each example.

For example:

> (test-dtree loan-dtree-example loan-data-small loan-names)
The example: (own late high good)
was classified as: yes
WRONG: the correct classification is no.

The example: (own on-time medium good)
was classified as: default-answer
WRONG: the correct classification is yes.

The example: (own on-time high good)
was classified as: yes
WRONG: the correct classification is no.

The example: (own on-time high good)
was classified as: yes
WRONG: the correct classification is no.


Out of 4 examples, 0 were correctly classified.

(log2 x)
Takes the logarithm of x to base 2. Signals an error if you try to take the logarithm of a nonpositive number.
For example:
```
(log2 8)
;Value: 3.
```
(print . args)
Takes zero or more arguments and prints them to the screen.

Weighted training data

To handle training data with missing attribute values, we will split an example into "fractional parts", in proportion to the distribution of examples with that attribute value.

The support code will allow you to weight training examples to accomplish this. Whereas a regular training example is of the form:

    (Yes (Own  Late    Low    Bad))

A weighted training example has a number inserted as the first element of this list:

    (1 Yes (Own  Late    Low    Bad))

The support code procedures split-tdata and tally-tdata will handle weighted training data.

(define weighted-loan-data-small
  '((0.5 No  (Own  Late    High   Good))
    (0.3 Yes (Own  On-time Medium Good))
    (0.3 No  (Own  On-time High   Good))
    (1   No  (Own  On-time High   Good))))

(split-tdata weighted-loan-data-small loan-names 'bills)
;Value: ((on-time ((1 no (own on-time high good))
;                  (.3 no (own on-time high good))
;                  (.3 yes (own on-time medium good))))
;        (late ((.5 no (own late high good)))))

(tally-tdata weighted-loan-data-small)
;Value: ((yes .3) (no 1.8))

Please note that all training data must be the same, i.e. all unweighted training examples or all weighted training examples.

Useful built-in Scheme procedures

There are a number of useful procedures, some part of standard Scheme, some extensions that are specific to MIT/GNU Scheme.

(sort Lst proc)
Sorts the given list. The second argument must be a procedure of two arguments that returns true if the first argument should come before the second argument. This procedure should not return #t if the arguments are "equal". (See the MIT/GNU Scheme reference manual (Section 7.9) for more information.)
For example:
```
(sort '(43 8 2 88 9) <)
;Value: (2 8 9 43 88)
```
(assoc key a-list)
The second argument a-list must be an association list. This is a list where each element is an association. Each association is a list where the first element is the key. The assoc procedure takes the given key and finds the association in a-list that matches (as per equal?) the key.
Note that the splits returned by split-tdata have the form of an associaton list.
For example:
```
(assoc 'red '((blue 7) (red 6) (green 5)))
;Value: (red 6)
```
(delete el Lst)
Deletes all copies of the given element el from the list lst. The equal? predicate is used to determine if the element matches el. Actually, this doesn't really delete elements from the list — it makes a copy of the list without any element equal? to el.
For example:
```
(delete 4 '(1 3 5 4 2.5 4.0 9 4))
;Value: (1 3 5 2.5 4. 9)
```
Note that the value 4.0 was not removed because it is not equal? to 4.

Discretization

For problems 4 and 5, you are to write the procedures:

(create-binary-discretization c-attribute training-data attribute-names)

and

(create-multi-discretization c-attribute training-data attribute-names)

which will decide upon threshold value(s) for the continuous valued attribute c-attribute.

For example, suppose we have the following data set:

(define continuous-names
  '(age weight height))
(define continuous-tdata
  '((republican (23 135 177))
    (democrat   (42 225 190))
    (democrat   (67 170 150))
    (republican (18 155 187))
    (democrat   (32 190 195))))

I wrote a "fake" version of create-binary-discretization that doesn't do the proper calculation, but it does return what it should — a number. For example:

(create-binary-discretization 'age continuous-tdata continuous-names)
For the attribute "age", max=67, min=18, so threshold is 42.5
;Value: 42.5

I have my procedure print out a little information about the threshold, as you can see.

The create-multi-discretization procedure is similar, but it must return two thresholds so that the continuous-valued attribute is turned into a discrete attribute with three values. These two thresholds must be returned in a list of two numbers with the lower number first.

For example, my "fake" version of this procedure does the following:

(create-multi-discretization 'height continuous-tdata continuous-names)
For the attribute "height", max=195, min=150, so thresholds are 165. and 180.
;Value: (165. 180.)

Again, I have my procedure print out a little information on the threshold.

Now, you usually don't want to just discretize one continuous-valued attribute — you want to discretize all of them! I have therefore provided the procedure discretize-tdata (available in the new version of a6code.com) which will apply the create-X-discretization procedure to all the continuous valued attributes. Of course, you have to specify which attributes are continuous-valued, and you must also specify what symbols you want to use for the discrete values.

For example:

(discretize-tdata create-binary-discretization
                  '(weight age height) 
                  '((light heavy)   ; symbols for weight
                    (young old)     ; symbols for age
                    (short tall))   ; symbols for height
                  continuous-tdata
                  continuous-names)
For the attribute "weight", max=225, min=135, so threshold is 180.
For the attribute "age", max=67, min=18, so threshold is 42.5
For the attribute "height", max=195, min=150, so threshold is 172.5
;Value: ((republican (young light tall))
;        (democrat   (young heavy tall))
;        (democrat   (old   light short))
;        (republican (young light tall))
;        (democrat   (young heavy tall)))

And:

(discretize-tdata create-multi-discretization
                  '(weight age height) 
                  '((light medium heavy)    ; symbols for weight
                    (young middle-aged old) ; symbols for age
                    (short average tall))   ; symbols for height
                  continuous-tdata
                  continuous-names)
For the attribute "weight", max=225, min=135, so thresholds are 165. and 195.
For the attribute "age", max=67, min=18, so thresholds are 34.3 and 50.7.
For the attribute "height", max=195, min=150, so thresholds are 165. and 180.
;Value: ((republican (young       light  average))
;        (democrat   (middle-aged heavy  tall))
;        (democrat   (old         medium short))
;        (republican (young       light  tall))
;        (democrat   (young       medium tall)))

Here are the "fake" discretization procedures that I used above:

(define (create-binary-discretization c-attribute 
                                      training-data attribute-names)
  ; here's my fake example --- it finds the max and the min values
  ; and takes the average.
  (let* ((data-pos (- (length attribute-names)
                      (length (member c-attribute attribute-names))))
         (attr-values (map (lambda (x)
                             (list-ref (second x) data-pos))
                           training-data))
         (minval (apply min attr-values))
         (maxval (apply max attr-values))
         (t (/ (+ minval maxval) 2.0)))
    (print "For the attribute \"" c-attribute "\", max=" maxval
           ", min=" minval ", so threshold is " t "\n")
    t))

(define (create-multi-discretization c-attribute 
                                     training-data attribute-names)
  ; here's my fake example --- it finds the max and the min values
  ; and picks thresholds 1/3 and 2/3 in between
  (let* ((data-pos (- (length attribute-names)
                      (length (member c-attribute attribute-names))))
         (attr-values (map (lambda (x)
                             (list-ref (second x) data-pos))
                           training-data))
         (minval (apply min attr-values))
         (maxval (apply max attr-values))
         (diff (- maxval minval))
         (t1 (+ (/ diff 3.0) minval))
         (t2 (+ (/ diff 3.0) t1)))
    (print "For the attribute \"" c-attribute "\", max=" maxval
           ", min=" minval ", so thresholds are " t1 " and " t2 "\n")
    (list t1 t2)))

Submit your code to the web testers

This assignment is set up with three separate web testers:

Problem 2 (learn-dtree) Web-tester
Problem 3 (missing-learn-dtree) Web-tester
Problem 4 (create-binary-discretization) Web-tester
This web tester runs a number of tests on your binary-discretization procedure, some with and some without missing continuous-valued attributes.
Problem 5 (create-multi-discretization) Web-tester
This web tester only runs tests for 10 (out of 20) points for this problem. The remaining points will be assigned after we read your writeup on what you did for this procedure. It makes sure that the thresholds you return actually discretize the attribute into three values (not just one or two), and it tests whether your discretization provides at least as much information gain as a binary discretization.

The Policy for Electronic Submission is the same as for previous assignments. The syntax-checker is the same as for the last assignment.