Assignment 6 information

Announcements

Support code & data sets

Problem 2 information

Here's an example of how you would use the support code to help test your learn-dtree procedure:

(define dt (learn-dtree mushroom-data1 mushroom-names))
;Value: dt

dt
;Value: (odor (c p)
;             (p p)
;             ...)

; take the 10th example from the second data set
(define t (list-ref mushroom-data2 9))
;Value: t

(car t) ; the correct classification
;Value: p

(second t) ; the attribute values
;Value: (x s n t p f c n w e e s s w w p w o p n s g)

(classify (second t) dt mushroom-names 'attribute-not-found)
;Value: p

; it was right!

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; test the learned decision tree on the second data set
;
(test-dtree dt mushroom-data2 mushroom-names)
The example: (f s w t f f c b w t b f s w w p w o p h v u)
was classified as: p
CORRECT

...

Out of 100 examples, 96 were correctly classified.

Problem 3

For this problem, you must do two things: Obviously, "I just made up some values for my discretize procedure" is not as good as some more principled approach to the problem; we discussed an approach to discretizing continuous valued attributes in class.

The training data sets for this problem are from the "Wisconsin Breast Cancer database". The examples in this database consist of measurements of a tissue sample; the classification is whether the sample is malignant or benign.

The measurements were done automatically from digitized images. You can see some of the original images at http://dollar.biz.uiowa.edu/~street/xcyt/images/.

There are 30 continuous valued attributes for each example. You must write a procedure:

  (discretize attribute-values)
which will take a list of the 30 attributes for an example from this database and return a list of attribute values that can be used by my learn-dtree procedure for Problem 2. Documentation for the data set is in the bc-data.txt file. The data set along with some support code to help you test your procedure are in the file bc-data.scm.

Here's an example discretize function (that doesn't purport to be any good since I just made up these threshold values):

(define (discretize attribute-values)
  (list (if (> (list-ref attribute-values 3) 1000) ; check fourth element
            'area>1000
            'area<=1000)
        (if (> (list-ref attribute-values 9) 0.06) ; check tenth element
            'fracdim>0.06
            'fracdim<=0.06)))
Note that this example takes the list of 30 attribute values and returns a list of two attribute values.

Here's how you could test your discretize procedure using the procedures I've provided in the bc-data.scm file:

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; learn the decision tree from discretized data
;
(define dt (ldt-disc discretize bc-data-m1))
;Value: dt

dt
;Value: (1 (area>1000 m)
;	   (area<=1000 (2 (fracdim<=0.06 b)
;			  (fracdim>0.06 b))))

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; classify a single example with the learned decision tree
;
; attribute values only of the first example from another data set
(define ex (second (first bc-data-m2)))
;Value: ex

ex
;Value 17: (11.22 33.81 70.79 386.8 .0778 .03574 ... )

(discretize ex)
;Value 18: (area<=1000 fracdim<=0.06)


(classify-disc discretize ex dt 'm)
;Value: b

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; test the learned decision tree against another data set
;
(testdt-disc discretize dt bc-data-s)
The example: (area<=1000 fracdim<=0.06)
was classified as: b
CORRECT

The example: (area>1000 fracdim>0.06)
was classified as: m
CORRECT

The example: (area<=1000 fracdim>0.06)
was classified as: b
WRONG: the correct classification is m.

The example: (area<=1000 fracdim>0.06)
was classified as: b
WRONG: the correct classification is m.

...

Out of 20 examples, 11 were correctly classified.