CSCI 4150: Introduction to Artificial Intelligence, Fall 2004 |
(define missing-example-names '(color height size)) (define missing-example-tdata '((expensive (red tall large)) (expensive (blue short ?)) (cheap (blue tall large)) (expensive (red tall small)) (cheap (red short ?)) (expensive (red short large))))
Color: Red: expensive 3 I(1/4,3/4) * 4/6 = 0.54 cheap 1 Blue: expensive 1 I(1/2,1/2) * 2/6 = 0.33 cheap 1 Information required = 0.87 bits Height: Tall: expensive 2 I(2/3,1/3) * 3/6 = 0.46 cheap 1 Short: expensive 2 I(2/3,1/3) * 3/6 = 0.46 cheap 1 Information required = 0.92 bits
Size: Large: expensive 2.75 I(2.75/4.50,1.75/4.50) * 4.5/6.0 = 0.72 cheap 1.75 Small: expensive 1.25 I(1.25/1.50,0.25/1.50) * 1.5/6.0 = 0.16 cheap 0.25 Information required = 0.88 bits
You should weight the training data right from the start rather than trying to deal with multiple data types (weighted and unweighted) and only switch when necessary.
First I'll weight all the data:
(define weighted-missing-example-tdata (map (lambda (x) (cons 1 x)) missing-example-tdata))
Note that when I split on size, I get this:
(split-tdata weighted-missing-example-tdata missing-example-names 'size) ;Value: ((small ((1 expensive (red tall small)))) ; (? ((1 cheap (red short ?)) ; (1 expensive (blue short ?)))) ; (large ((1 expensive (red short large)) ; (1 cheap (blue tall large)) ; (1 expensive (red tall large)))))
This is because split-tdata doesn't know anything about ? being a special attribute value to indicate missing data.
The simplist thing to do — for computing the information required and for then doing recursive calls to learn subtrees — is to fix this split by dividing the data with missing attribute values.
For example, I would transform the above split into:
((small ((1 expensive (red tall small)) (0.25 cheap (red short ?)) (0.25 expensive (blue short ?)))) (large ((1 expensive (red short large)) (1 cheap (blue tall large)) (1 expensive (red tall large)) (0.75 cheap (red short ?)) (0.75 expensive (blue short ?)))))
Note that I didn't change the missing attribute value to large or small because my program will never examine this attribute value again!
Once you have fixed the split, you can use tally-tdata on each part of the split, and it will properly count the weighted examples.
(split-tdata weighted-missing-example-tdata missing-example-names 'color) ;Value: ((blue ((1 cheap (blue tall large)) ; (1 expensive (blue short ?)))) ; (red ((1 expensive (red short large)) ; (1 cheap (red short ?)) ; (1 expensive (red tall small)) ; (1 expensive (red tall large)))))
Height: Short: expensive: 1 I(1/2,1/2) * 2/4 = 0.50 cheap: 1 Tall: expensive: 2 I(1,0) * 2/4 = 0.00 cheap: 0 Information required = 0.50 bits
Note that these values are the original weight, multiplied by the respective fraction.
Size: Large: expensive: 2.00 I(2/2.67,0.67/2.67) * 2.67/4 = 0.54 cheap: 0.67 Small: expensive: 1.00 I(1/1.33,0.33/1.33) * 1.33/4 = 0.27 cheap: 0.33 Information required = 0.81 bits
You can detect this condition by checking if the split for that attribute has length 1 (i.e., only one attribute value) and the name of the attribute value is ?.
Here is what you should do to handle this situation. This should require two minor modifications to your missing-learn-dtree.
Just to be clear, this split will look something like this:
((? ((1 yes (red ?)) (1 no (blue ?)) (1 no (red ?)))))