A6: handling missing attributes

Suppose we have the following data set:

(define missing-example-names
  '(color height size))

(define missing-example-tdata
  '((expensive (red  tall  large))
    (expensive (blue short ?))
    (cheap     (blue tall  large))
    (expensive (red  tall  small))
    (cheap     (red  short ?))
    (expensive (red  short large))))

To learn a decision tree, we first choose the attribute with the greatest information gain. We'll check all the attributes:

The first two are computed as ususal:

Color:  Red:   expensive  3   I(1/4,3/4) * 4/6 = 0.54
               cheap      1

        Blue:  expensive  1   I(1/2,1/2) * 2/6 = 0.33
               cheap      1

                          Information required = 0.87 bits

Height: Tall:  expensive  2   I(2/3,1/3) * 3/6 = 0.46
               cheap      1

        Short: expensive  2   I(2/3,1/3) * 3/6 = 0.46
               cheap      1

                          Information required = 0.92 bits

However, for size, the missing attributes will be split according to the distribution of attribute values present. Here, we have 3 large and 1 small, so therefore each ? becomes 3/4 large and 1/4 small. The information by splitting on size is:

Size:   Large: expensive  2.75 I(2.75/4.50,1.75/4.50) * 4.5/6.0 = 0.72
               cheap      1.75

        Small: expensive  1.25 I(1.25/1.50,0.25/1.50) * 1.5/6.0 = 0.16
               cheap      0.25

                                           Information required = 0.88 bits

So, we split on color as the top level attribute.
Note that I didn't show you anything with the Scheme data above to keep things focused on the strategy for handling missing attributes.
You should weight the training data right from the start rather than trying to deal with multiple data types (weighted and unweighted) and only switch when necessary.
First I'll weight all the data:
```
(define weighted-missing-example-tdata
  (map (lambda (x) (cons 1 x)) missing-example-tdata))
```
Note that when I split on size, I get this:
```
(split-tdata weighted-missing-example-tdata
	     missing-example-names
	     'size)
;Value: ((small ((1 expensive (red  tall  small))))
; 	 (?     ((1 cheap     (red  short ?))
; 		 (1 expensive (blue short ?))))
; 	 (large ((1 expensive (red  short large))
; 		 (1 cheap     (blue tall  large))
; 		 (1 expensive (red  tall  large)))))
```
This is because split-tdata doesn't know anything about ? being a special attribute value to indicate missing data.
The simplist thing to do — for computing the information required and for then doing recursive calls to learn subtrees — is to fix this split by dividing the data with missing attribute values.
For example, I would transform the above split into:
```
((small ((1    expensive (red  tall  small))
         (0.25 cheap     (red  short ?))
	 (0.25 expensive (blue short ?))))
 (large ((1    expensive (red  short large))
	 (1    cheap     (blue tall  large))
	 (1    expensive (red  tall  large))
         (0.75 cheap     (red  short ?))
	 (0.75 expensive (blue short ?)))))
```
Note that I didn't change the missing attribute value to large or small because my program will never examine this attribute value again!
Once you have fixed the split, you can use tally-tdata on each part of the split, and it will properly count the weighted examples.

Please note that you can't "fix" all the data up front and then run your regular learn-dtree on that training data. To illustrate this, consider whan happens when we continue the above example after splitting on color.

(split-tdata weighted-missing-example-tdata
	     missing-example-names
	     'color)
;Value: ((blue ((1 cheap     (blue tall  large))
; 		(1 expensive (blue short ?))))
; 	 (red  ((1 expensive (red  short large))
; 		(1 cheap     (red  short ?))
; 		(1 expensive (red  tall  small))
; 		(1 expensive (red  tall  large)))))

Let me just consider the red subtree. We will consider splitting on height and size:

Calculating the information required for height is the same as usual.

Height: Short: expensive: 1  I(1/2,1/2) * 2/4 = 0.50
       	       cheap:     1

       	Tall:  expensive: 2  I(1,0) * 2/4     = 0.00
               cheap:     0

                         Information required = 0.50 bits

For size we will again divide the missing attribute according to the distribution of the attribute values present. However here, for color=red, we have 2 large and 1 small, so it becomes 0.67 large and 0.33 small. This is different that the fractions that we used above at the top level.
Note that these values are the original weight, multiplied by the respective fraction.
```
Size:  Large: expensive: 2.00  I(2/2.67,0.67/2.67) * 2.67/4 = 0.54
              cheap:     0.67

       Small: expensive: 1.00  I(1/1.33,0.33/1.33) * 1.33/4 = 0.27
              cheap:     0.33

                                       Information required = 0.81 bits
```
There is one little problem with what we've done so far: there might be an attribute for which all the examples have missing values.
You can detect this condition by checking if the split for that attribute has length 1 (i.e., only one attribute value) and the name of the attribute value is ?.
Here is what you should do to handle this situation. This should require two minor modifications to your missing-learn-dtree.
- In your procedure that fixes a split, if this condition occurs, then just return the split as it is, without changing it. This is the first modification.
  Just to be clear, this split will look something like this:
```
((? ((1 yes (red  ?))
     (1 no  (blue ?))
     (1 no  (red  ?)))))
```
- Since you didn't fix the split, it will be tallied like this. However, this split contains all the examples in the current training data, so it will have the same amount of information required as the original training data (i.e., 0 information gain).
- Now, this attribute could be selected for splitting on. This means that either this is the only attribute left or that all the other attributes also have a 0 information gain, and this one happened to be selected.
- Of course, you can't actually split on this attribute because you don't have any of its values to use for the split. Therefore one more modification is required.
- In your top level decision tree learning procedure for this algorithm, you are somewhere selecting the best attribute, and then you construct a decision tree that splits on this attribute and then pass each element of the split to a recursive call in order to learn a decision tree on that subset of the data.
- When all the data has missing values for the attribute you want to split on, you can't do this. Instead, the decision tree you should return is just a leaf node that picks the majority value of the data. This is the second modification required.