Assignment 6 information

Announcements
Support code
Problem 2 information
Problem 3 information
Questions & Answers
Submit problem 2 (for policy (except for due dates), see the assignment 4 web page)
Submit problem 3 (for policy (except for due dates), see the assignment 4 web page)

Announcements

[11/18] I've been fielding a fair bit of email, so I've put a bunch of it on the "Questions & Answers" page above.
[11/18] The web tester for problem 3 is up!
[11/18] Here's something of an example that might be helpful to you for problem 3.
[11/18] I released a new version of a6code.com which allows you to turn off printing all the examples when you run test-dtree (or testdt-disc) by:
```
  (define print-info #f)
```
[11/18] A decision tree where all the examples have the same classification would be simply the goal predicate value. For example, the following training data:
```
   Color   Soothing?
1  Green    Yes
2  Pink     Yes
3  Yellow   Yes
```
would result in the decision tree:
```
  yes
```
[11/18] A clarification on Problem 3 and the discretize procedure:
The discretize procedure you will write will be specific to the "Wisconsin Breast Cancer Database". This problem is not about writing a procedure that will work on any training data with continuous valued attributes. In order to write this procedure, you have several training data sets of this data. You can, for example, write a procedure which will test potential threshold values for an attribute to turn it from a continuous valued attribute into a discrete valued attribute.
You can discretize one or all of the attributes in this data set. You should describe what you did to write your discretize procedure. If you've written code to help you analyze the training data to determine threshold values, include it. Half the points for this problem will be based on the written part, and half will be based on how well your discretize procedure works --- the web tester will use it to learn and then test a decision tree. The delay in getting the web tester up for this problem is that I need to determine what are reasonable performance levels for this problem.
[11/16] You can turn in the written part of Problem 3 the day after your code submission to the webtester. For example, if you turn in your code on Monday, the written part is due on Tuesday 5pm.
[11/16] The web tester for Problem 2 is up; see the link above. The problem 3 web tester should be up early tomorrow morning.
[11/14] I've released another data set for problem 2 that you might find interesting!
[11/13] As announced in class yesterday, I am changing problem 3 to be worth 20 points (instead of 10).
[11/13] I've released a new version of the bc-data.scm file which corrects a bug.
[11/8] Problem 3 is out along with support code. Also added the "Problem 2 information" section.
[11/8] I've put out a new version of a6code.com. The only difference is that I added a log2 procedure which, in addition to computing logarithms to base 2, will test whether you give it an invalid argument.
[11/8] DO NOT TAKE THE LOG OF ZERO!!!
[11/8] Support code is out as well as three data sets you can use to test your code for problem 2. Details for problem 3 and additional data sets should be up later today.

Support code & data sets

The main support code is contained in the file a6code.com (Version 1.2.1 released 11/18).
Three data sets are contained in the file: a6data.scm. They are:
- the snorkel data set from the first problem
- the restaurant example from the R&N text
- the mushroom database
See the assignment handout and the comments in this file for the data formats.
An additional data set for Problem 2 is in the file quality.scm. It is data on quality of life at a number of universities.
The files for Problem 3 are:
- bc-data.scm which contains support code and the training data sets. (Version 1.1 released 11/13)
- bc-data.txt which contains documentation on the training data sets

Problem 2 information

Here's an example of how you would use the support code to help test your learn-dtree procedure:

(define dt (learn-dtree mushroom-data1 mushroom-names))
;Value: dt

dt
;Value: (odor (c p)
;             (p p)
;             ...)

; take the 10th example from the second data set
(define t (list-ref mushroom-data2 9))
;Value: t

(car t) ; the correct classification
;Value: p

(second t) ; the attribute values
;Value: (x s n t p f c n w e e s s w w p w o p n s g)

(classify (second t) dt mushroom-names 'attribute-not-found)
;Value: p

; it was right!

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; test the learned decision tree on the second data set
;
(test-dtree dt mushroom-data2 mushroom-names)
The example: (f s w t f f c b w t b f s w w p w o p h v u)
was classified as: p
CORRECT

...

Out of 100 examples, 96 were correctly classified.

Problem 3

For this problem, you must do two things:

Write a discretize procedure
Turn in a written explanation of how you designed your discretize procedure.

Obviously, "I just made up some values for my discretize procedure" is not as good as some more principled approach to the problem; we discussed an approach to discretizing continuous valued attributes in class.

The training data sets for this problem are from the "Wisconsin Breast Cancer database". The examples in this database consist of measurements of a tissue sample; the classification is whether the sample is malignant or benign.

The measurements were done automatically from digitized images. You can see some of the original images at http://dollar.biz.uiowa.edu/~street/xcyt/images/.

There are 30 continuous valued attributes for each example. You must write a procedure:

  (discretize attribute-values)

which will take a list of the 30 attributes for an example from this database and return a list of attribute values that can be used by my learn-dtree procedure for Problem 2. Documentation for the data set is in the bc-data.txt file. The data set along with some support code to help you test your procedure are in the file bc-data.scm.

Here's an example discretize function (that doesn't purport to be any good since I just made up these threshold values):

(define (discretize attribute-values)
  (list (if (> (list-ref attribute-values 3) 1000) ; check fourth element
            'area>1000
            'area<=1000)
        (if (> (list-ref attribute-values 9) 0.06) ; check tenth element
            'fracdim>0.06
            'fracdim<=0.06)))

Note that this example takes the list of 30 attribute values and returns a list of two attribute values.

Here's how you could test your discretize procedure using the procedures I've provided in the bc-data.scm file:

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; learn the decision tree from discretized data
;
(define dt (ldt-disc discretize bc-data-m1))
;Value: dt

dt
;Value: (1 (area>1000 m)
;	   (area<=1000 (2 (fracdim<=0.06 b)
;			  (fracdim>0.06 b))))

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; classify a single example with the learned decision tree
;
; attribute values only of the first example from another data set
(define ex (second (first bc-data-m2)))
;Value: ex

ex
;Value 17: (11.22 33.81 70.79 386.8 .0778 .03574 ... )

(discretize ex)
;Value 18: (area<=1000 fracdim<=0.06)


(classify-disc discretize ex dt 'm)
;Value: b

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; test the learned decision tree against another data set
;
(testdt-disc discretize dt bc-data-s)
The example: (area<=1000 fracdim<=0.06)
was classified as: b
CORRECT

The example: (area>1000 fracdim>0.06)
was classified as: m
CORRECT

The example: (area<=1000 fracdim>0.06)
was classified as: b
WRONG: the correct classification is m.

The example: (area<=1000 fracdim>0.06)
was classified as: b
WRONG: the correct classification is m.

...

Out of 20 examples, 11 were correctly classified.