Assignment 6 questions & answers

Note:
A few students have told me that they got the error:
```
;Hardware trap SIGSEGV
```
This turns out to be because my compiled support code is trying to take the car of a symbol. The reason for this is that the wrong kind of arguments were passed to one of the support code procedures. (In this case, a "split" was passed as the first argument to split-tdata instead of a list of training data.) (I'll make a note that I should add more error checking to my procedures for next year...)

Question:

Why aren't I doing better on the "classification" tests for Problem 2. Can you give us some more information to help debug?

Answer:

Here's what I get with my solutions using the "quality" data set (which you have). (Of course I am using a different data set to test your procedure for the web tester.)

(define mdt (learn-dtree quality-data-large quality-names))
;Value: mdt
(test-dtree mdt quality-data-small quality-names)
;Value 4: (40 50)

Here's the decision tree I get (in case you're curious...)

(state (tennessee 4) (michigan 4) (virginia 4) (connecticut 4) (washington 3)
 (indiana (enrollment (thous:5-10 4) (thous:0-5 3))) (southdakota 2) (maine 3)
 (district-of-columbia (control (state 2) (private 3))) (oregon 2)
 (rhodeisland 5) (minnesota 4) 
(oklahoma (location (suburban 3) (small-town 4) (urban 4)))
 (maryland 3) (louisiana 4) (colorado (location (small-city 5) (suburban 3)))
 (california (applicants (thous:7-10 4) (thous:0-4 (academics (5 3) (3 4) (4 5))) (thous:13-17 5) (thous:4-7 (location (urban 3) (suburban 4))) (thous:10-13 3)))
 (texas (enrollment (thous:0-5 3) (thous:20+ 3) (thous:10-15 2) (thous:5-10 3)))
 (newyork (enrollment (thous:5-10 (location (small-town 4) (urban 3))) (thous:20+ 4) (thous:10-15 2) (thous:15-20 3) (thous:0-5 (m:f-ratio (mf>=5:1 3) (mf>=1:1 (location (suburban 3) (urban 2))) (mf<1:1 4)))))
 (ohio 3) 
(pennsylvania (enrollment (thous:15-20 2) (thous:20+ 4) (thous:0-5 3) (thous:5-10 (location (small-city 2) (urban 3)))))
 (newjersey (expenses (thous$:7-10 4) (thous$:4-7 3) (thous$:10+ 3)))
 (illinois 3)
 (massachusetts (sat/math (sat>=600 (sat/verbal (sat>=600 3) (sat>=700 4) (sat>=500 4))) (sat>=700 3) (sat>=500 (location (urban 3) (small-town 5) (suburban 3) (small-city 3)))))
 (newhampshire 3) (florida (location (urban 4) (small-city 3))))

If you want to see this printed nicely, give it to the "pp" procedure (which stands for "pretty print").

Question:
[For problem 2,] if our decision tree returns a different tree than yours, but it gets the right value for the (classify) function and it passes all of the tests in the (test-dtree) function, is this going to decrease our grade?
It seems as if my tree is working and it gets all 12 of the test cases correct for the restaurant problem, 200 of 200 test cases for the mushroom problem when training with mushroom-data1 and testing on mushroom-data1, 96 of 100 test cases for the mushroom problem when training with mushroom-data1 and testing on mushroom-data2. However when I compare my tree to yours using the restaurant problem, they are slightly different. Not the order, rather the attributes it chooses further down in the tree. Is this going to be negatively graded?
Answer:
For the "basic" tests, your decision tree must match the solution more or less exactly (the order that the attributes appear at any level doesn't matter). The examples used for testing have been designed so that there is a unique answer, i.e. there are no ties to break arbitrarily in these examples.
In the "information" tests, it only checks whether you've picked the correct top level attribute.
In the "classify" tests, the web tester only looks at the number of examples you classify correctly; it does not look at your decision tree.
Question:
[for Problem 3] ...when I generate the best [threshold] values for all my attributes they are those values for that specific training data set. I get different values for bc-data-s then bc-data-m1 and so forth. So if I know what my best threshold value is for each of these training sets, then how can I get the best for all of them?
Answer:
Well, the ideal situation would be if you could get a copy of the testing data set that I am using. That way you could use the values generated from the data set that will be actually used to measure the performance of a decision tree learned with your discretize procedure.
However, the point of learning is that you train a system on training data so it can work on future (unseen) inputs. We make the assumption that the training data are representative, and that's as close as we can get to knowing what the future inputs will be.
In general, the strategy for getting the best attribute values is to use as much training data as you can get. One thing I should mention is that all the training data sets you have were randomly selected from a larger set of 379 examples, so the intersection of any two of your data sets may not by empty. The testing data set that the web tester uses consists of 190 examples; there is no overlap with the 379 example training data set.
Another thing I'll mention about the web tester is that the decision tree is learned from the bc-data-l1 and bc-data-l2 (appended together).
Question:
The algorithm for learning a decision tree states that you should return the default if the training-data passed to the function is empty. However, we don't seem to have a default value. What are we supposed to return for the default value? Since the default value might change for each tree, I'm not sure what we'd return in order for the learning tree to be universal.
Answer:
You do not need a default value when learning a decision tree because you will never have a "split" that is empty. That's because the splits are done based upon what training data is there. It does not know all the possible attribute values when it does a split.
When you test a decision tree (the way I have structured this assignment), then you need to give it a default value.
Question:
I know that the actual classifications for a goal predicate will not necessarily be 'yes' or 'no' in the training-data, but can we always count on there being just two classifications for a goal predicate, or can there be any number of classifications? (for example, good-snorkeling? = yes, no, or maybe)
Answer:
There can be more than two goal classifications.
Question:
Can we assume there is always at least one example in the original training-data given to us? (kind of a trivial question yet important)
Answer:
yes
Question:
Another question I have regarding problem 2 and learning decision trees in general. First, in order to find the root node, you take the total information to be learned, find the information learned by a certain attribute and minus that from the total information to be learned. This result is your information gain. Correct? For example, if the total information to be learned is 1 bit, and the highest information value for an attribute is .5, then 1 - .5 = .5 which would be the total information gain for splitting on that attribute. If that was the highest information gain of all the attributes then that particular attribute would be your root node.
Now my question is what happens when you're finding the next node. What is the total information that you subtract from? The information to be learned for the entire problem when finding the root node was 1 bit. So we subtracted the information to be learned from an attribute from this total to get .5. Now what do we subtract from? When we evaluate further nodes, to we still subtract that value from 1 in order to get the information gain? Or do we subtract it from the total we received from the root node, .5? Or do we subtract it from some other number?
Answer:
there is no "highest information value for an attribute". you split the examples according to the values of one particular attribute, calculate the information still required to classify each subset of examples, and then compute an 'expected value' of the information required after splitting on the attribute. you would pick an attribute for the root node of the (sub)tree that had the greatest information gain, as you otherwise described.
the recursive calls to the algorithm take each subset of the training data from a split. the information required to classify those examples would be the "starting information" for the recursive call.
Question:
I think my approach to the discretize function is slighltly too extensive. What I have been doing is creating a list of breaks between the classification of a single attribute. For example: If attribute 1 (called 'ave-radius) had the following values in the training data:
```
(ave-radius-values (0 4 7 2 5 8 9))
```
and they were classified as
```
(ave-radius-class (b m b b b m m))
```
then i sort the data with their classifications:
```
((0 b)
 (2 b)
 (4 m)
 (5 b)
 (7 b)
 (8 m)
 (9 m))
```
then i look for the changes in classifications ("breaks") and combine those lists.
```
((0 - 2)    ;from 0 to 2 all were classified as b
  4         ;4 was classified as m
 (5 - 7)    ;from 5 to 7 all were classified as b
 (8 - 9))   ;from 8 to 9 all were classified as m
```
Now here is where I get confused. You choose one value for the threshold. I thought we were supposed to choose threshold values dependent on where the changes occur in classification. In this instance I would have 4 threshold values like the list above. But you seem to choose one threshold value dependent on the 1st initial change in classification. What I am really asking is, which is the method you taught us in class. Was it
1. Find all threshold values where the changes occur in the sorted list, Or was it
2. Find one threshold value which is determined by the first change in classification of the sorted list.
Answer:
the method we discussed in class was to find all the values where the goal attribute changes and then test each one to see which is the best to use for a threshold which would decide between two (discrete) values. Of course, you could extend this even further and have multiple thresholds to decide between more than two discrete values, but (at least in this case) this shouldn't be necessary.
Question:
I have a quick question regarding problem 1 of the current assignment. It states that we need to start the learning tree. So all you want us to do is find the attribute that would be set as the root node and stop? You don't want us to figure out the entire decision tree? I'm just not completely clear on this, because this is what I thought we were supposed to do on the last quiz, but then I later saw when you went over the quiz that apparently on it we were supposed to figure out the entire tree. I would hate to make that mistake again.
Answer:
Yes, you only need to determine the attribute at the root node of the decision tree for Problem 1 of the assignment.

Assignment 6 questions & answers

Note:

Question:

Answer:

Question:

Answer:

Question:

Answer:

Question:

Answer:

Question:

Answer:

Question:

Answer:

Question:

Answer:

Question:

Answer:

Question:

Answer: