CSCI 4150: Introduction to Artificial Intelligence, Fall 2004

Assignment 6 information

Errata & announcements

Support code

Data sets

Note that the last three data sets actually define different data sets for that domain. To properly gauge the accuracy of the decision tree, you should learn the decision tree using one data set and test it using a different data set.

Problem 6 Writeup

There are several parts to the writeup for Problem 6:
  1. Run your learn-dtree procedure using the mushroom data set. Learn a decision tree using mushroom-data1 and test it using mushroom-data2. How well does your decision tree do on this data set? (Report the number and percent of examples correctly classified when you test your decision tree.)

  2. Run your missing-learn-dtree procedure using the voting data set. Learn a decision tree using voting-data1 and test it using voting-data2. Report the number and percent of examples correctly classified by your decision tree.

  3. Take the class survey data set, and turn it into training data with "Color" as the goal predicate. Then use your create-binary-discretization procedure to discretize the continuous-valued attributes of the data set. Report the threshold used for discretizing each attribute. See the Discretization section below for some hints on doing this.

    Learn a decision tree from the discretized version of the data set using your learn-dtree procedure. Include the resulting decision tree in your writeup.

  4. Describe how your create-multi-discretization procedure chooses the two thresholds it returns.

  5. Use your create-multi-discretization procedure to do the same as above (for your create-binary-discretization procedure in part 3) on the class survey data set.

  6. Use other attributes (besides "Color") from the class survey data set as goal predicates, discretize the continuous-valued attributes (using either binary- or multi- discretization), and learn a decision tree. Find some interesting results on how to predict certain attributes (as a goal predicate) from other attribute values. Be sure to report the thresholds used in your discretizations and show the resulting decision tree you learned.

Notes on the assignment

Documentation & other information

Basic support code

These are procedures that are described in the assignment handout, perhaps with more examples.

Weighted training data

To handle training data with missing attribute values, we will split an example into "fractional parts", in proportion to the distribution of examples with that attribute value.

The support code will allow you to weight training examples to accomplish this. Whereas a regular training example is of the form:

    (Yes (Own  Late    Low    Bad))
A weighted training example has a number inserted as the first element of this list:
    (1 Yes (Own  Late    Low    Bad))
The support code procedures split-tdata and tally-tdata will handle weighted training data.
(define weighted-loan-data-small
  '((0.5 No  (Own  Late    High   Good))
    (0.3 Yes (Own  On-time Medium Good))
    (0.3 No  (Own  On-time High   Good))
    (1   No  (Own  On-time High   Good))))

(split-tdata weighted-loan-data-small loan-names 'bills)
;Value: ((on-time ((1 no (own on-time high good))
;                  (.3 no (own on-time high good))
;                  (.3 yes (own on-time medium good))))
;        (late ((.5 no (own late high good)))))

(tally-tdata weighted-loan-data-small)
;Value: ((yes .3) (no 1.8))
Please note that all training data must be the same, i.e. all unweighted training examples or all weighted training examples.

Useful built-in Scheme procedures

There are a number of useful procedures, some part of standard Scheme, some extensions that are specific to MIT/GNU Scheme.

Discretization

For problems 4 and 5, you are to write the procedures:
(create-binary-discretization c-attribute training-data attribute-names)
and
(create-multi-discretization c-attribute training-data attribute-names)
which will decide upon threshold value(s) for the continuous valued attribute c-attribute.

For example, suppose we have the following data set:

(define continuous-names
  '(age weight height))
(define continuous-tdata
  '((republican (23 135 177))
    (democrat   (42 225 190))
    (democrat   (67 170 150))
    (republican (18 155 187))
    (democrat   (32 190 195))))
I wrote a "fake" version of create-binary-discretization that doesn't do the proper calculation, but it does return what it should — a number. For example:
(create-binary-discretization 'age continuous-tdata continuous-names)
For the attribute "age", max=67, min=18, so threshold is 42.5
;Value: 42.5
I have my procedure print out a little information about the threshold, as you can see.

The create-multi-discretization procedure is similar, but it must return two thresholds so that the continuous-valued attribute is turned into a discrete attribute with three values. These two thresholds must be returned in a list of two numbers with the lower number first.

For example, my "fake" version of this procedure does the following:

(create-multi-discretization 'height continuous-tdata continuous-names)
For the attribute "height", max=195, min=150, so thresholds are 165. and 180.
;Value: (165. 180.)
Again, I have my procedure print out a little information on the threshold.


Now, you usually don't want to just discretize one continuous-valued attribute — you want to discretize all of them! I have therefore provided the procedure discretize-tdata (available in the new version of a6code.com) which will apply the create-X-discretization procedure to all the continuous valued attributes. Of course, you have to specify which attributes are continuous-valued, and you must also specify what symbols you want to use for the discrete values.

For example:

(discretize-tdata create-binary-discretization
                  '(weight age height) 
                  '((light heavy)   ; symbols for weight
                    (young old)     ; symbols for age
                    (short tall))   ; symbols for height
                  continuous-tdata
                  continuous-names)
For the attribute "weight", max=225, min=135, so threshold is 180.
For the attribute "age", max=67, min=18, so threshold is 42.5
For the attribute "height", max=195, min=150, so threshold is 172.5
;Value: ((republican (young light tall))
;        (democrat   (young heavy tall))
;        (democrat   (old   light short))
;        (republican (young light tall))
;        (democrat   (young heavy tall)))
And:
(discretize-tdata create-multi-discretization
                  '(weight age height) 
                  '((light medium heavy)    ; symbols for weight
                    (young middle-aged old) ; symbols for age
                    (short average tall))   ; symbols for height
                  continuous-tdata
                  continuous-names)
For the attribute "weight", max=225, min=135, so thresholds are 165. and 195.
For the attribute "age", max=67, min=18, so thresholds are 34.3 and 50.7.
For the attribute "height", max=195, min=150, so thresholds are 165. and 180.
;Value: ((republican (young       light  average))
;        (democrat   (middle-aged heavy  tall))
;        (democrat   (old         medium short))
;        (republican (young       light  tall))
;        (democrat   (young       medium tall)))


Here are the "fake" discretization procedures that I used above:

(define (create-binary-discretization c-attribute 
                                      training-data attribute-names)
  ; here's my fake example --- it finds the max and the min values
  ; and takes the average.
  (let* ((data-pos (- (length attribute-names)
                      (length (member c-attribute attribute-names))))
         (attr-values (map (lambda (x)
                             (list-ref (second x) data-pos))
                           training-data))
         (minval (apply min attr-values))
         (maxval (apply max attr-values))
         (t (/ (+ minval maxval) 2.0)))
    (print "For the attribute \"" c-attribute "\", max=" maxval
           ", min=" minval ", so threshold is " t "\n")
    t))

(define (create-multi-discretization c-attribute 
                                     training-data attribute-names)
  ; here's my fake example --- it finds the max and the min values
  ; and picks thresholds 1/3 and 2/3 in between
  (let* ((data-pos (- (length attribute-names)
                      (length (member c-attribute attribute-names))))
         (attr-values (map (lambda (x)
                             (list-ref (second x) data-pos))
                           training-data))
         (minval (apply min attr-values))
         (maxval (apply max attr-values))
         (diff (- maxval minval))
         (t1 (+ (/ diff 3.0) minval))
         (t2 (+ (/ diff 3.0) t1)))
    (print "For the attribute \"" c-attribute "\", max=" maxval
           ", min=" minval ", so thresholds are " t1 " and " t2 "\n")
    (list t1 t2)))

Submit your code to the web testers

This assignment is set up with three separate web testers:

The Policy for Electronic Submission is the same as for previous assignments. The syntax-checker is the same as for the last assignment.