-
Note:
A few students have told me that they got the error:
;Hardware trap SIGSEGV
This turns out to be because my compiled support code is trying to
take the car of a symbol. The reason for this is that the
wrong kind of arguments were passed to one of the support code
procedures. (In this case, a "split" was passed as the first argument
to split-tdata instead of a list of training data.) (I'll
make a note that I should add more error checking to my procedures for next
year...)
-
Question:
Why aren't I doing better on the "classification" tests
for Problem 2. Can you give us some more information to help debug?
Answer:
Here's what I get with my solutions using the
"quality" data set (which you have). (Of course I am using a
different data set to test your procedure for the web tester.)
(define mdt (learn-dtree quality-data-large quality-names))
;Value: mdt
(test-dtree mdt quality-data-small quality-names)
;Value 4: (40 50)
Here's the decision tree I get (in case you're curious...)
(state (tennessee 4) (michigan 4) (virginia 4) (connecticut 4) (washington 3)
(indiana (enrollment (thous:5-10 4) (thous:0-5 3))) (southdakota 2) (maine 3)
(district-of-columbia (control (state 2) (private 3))) (oregon 2)
(rhodeisland 5) (minnesota 4)
(oklahoma (location (suburban 3) (small-town 4) (urban 4)))
(maryland 3) (louisiana 4) (colorado (location (small-city 5) (suburban 3)))
(california (applicants (thous:7-10 4) (thous:0-4 (academics (5 3) (3 4) (4 5))) (thous:13-17 5) (thous:4-7 (location (urban 3) (suburban 4))) (thous:10-13 3)))
(texas (enrollment (thous:0-5 3) (thous:20+ 3) (thous:10-15 2) (thous:5-10 3)))
(newyork (enrollment (thous:5-10 (location (small-town 4) (urban 3))) (thous:20+ 4) (thous:10-15 2) (thous:15-20 3) (thous:0-5 (m:f-ratio (mf>=5:1 3) (mf>=1:1 (location (suburban 3) (urban 2))) (mf<1:1 4)))))
(ohio 3)
(pennsylvania (enrollment (thous:15-20 2) (thous:20+ 4) (thous:0-5 3) (thous:5-10 (location (small-city 2) (urban 3)))))
(newjersey (expenses (thous$:7-10 4) (thous$:4-7 3) (thous$:10+ 3)))
(illinois 3)
(massachusetts (sat/math (sat>=600 (sat/verbal (sat>=600 3) (sat>=700 4) (sat>=500 4))) (sat>=700 3) (sat>=500 (location (urban 3) (small-town 5) (suburban 3) (small-city 3)))))
(newhampshire 3) (florida (location (urban 4) (small-city 3))))
If you want to see this printed nicely, give it to the "pp" procedure
(which stands for "pretty print").
-
Question:
[For problem 2,] if our decision tree
returns a different tree than yours, but it gets the right value for
the (classify) function and it passes all of the tests in the
(test-dtree) function, is this going to decrease our grade?
It seems as if my tree is working and it gets all 12 of the test cases
correct for the restaurant problem, 200 of 200 test cases for the mushroom
problem when training with mushroom-data1 and testing on mushroom-data1, 96
of 100 test cases for the mushroom problem when training with
mushroom-data1 and testing on mushroom-data2. However when I compare my
tree to yours using the restaurant problem, they are slightly different.
Not the order, rather the attributes it chooses further down in the tree.
Is this going to be negatively graded?
Answer:
For the "basic" tests, your decision tree must match the solution more
or less exactly (the order that the attributes appear at any level
doesn't matter). The examples used for testing have been designed so
that there is a unique answer, i.e. there are no ties to break
arbitrarily in these examples.
In the "information" tests, it only checks whether you've picked the
correct top level attribute.
In the "classify" tests, the web tester only looks at the number of
examples you classify correctly; it does not look at your decision tree.
-
Question:
[for Problem 3] ...when I generate the best [threshold] values for all
my attributes they are those values for that specific training data set. I
get different values for bc-data-s then bc-data-m1 and so forth. So if I
know what my best threshold value is for each of these training sets, then
how can I get the best for all of them?
Answer:
Well, the ideal situation would be if you could
get a copy of the testing data set that I am using. That way you
could use the values generated from the data set that will be actually
used to measure the performance of a decision tree learned with your
discretize procedure.
However, the point of learning is that you train a system on
training data so it can work on future (unseen) inputs. We make the
assumption that the training data are representative, and that's as
close as we can get to knowing what the future inputs will be.
In general, the strategy for getting the best attribute values is
to use as much training data as you can get. One thing I should
mention is that all the training data sets you have were randomly
selected from a larger set of 379 examples, so the intersection of any
two of your data sets may not by empty. The testing data set that the
web tester uses consists of 190 examples; there is no overlap with the
379 example training data set.
Another thing I'll mention about the web tester is that the
decision tree is learned from the bc-data-l1 and
bc-data-l2 (appended together).
-
Question:
The algorithm for learning a decision tree states that you should return
the default if the training-data passed to the function is empty. However,
we don't seem to have a default value. What are we supposed to return for
the default value? Since the default value might change for each tree, I'm
not sure what we'd return in order for the learning tree to be universal.
Answer:
You do not need a
default value when learning a decision tree because you will never
have a "split" that is empty. That's because the splits are done
based upon what training data is there. It does not know all the
possible attribute values when it does a split.
When you test a decision tree (the way I have structured this
assignment), then you need to give it a default value.
-
Question:
I know that the actual classifications for a goal predicate will not
necessarily be 'yes' or 'no' in the training-data, but can we always count
on there being just two classifications for a goal predicate, or can there
be any number of classifications? (for example, good-snorkeling? = yes, no,
or maybe)
Answer:
There can be more than two goal classifications.
-
Question:
Can we assume there is always at least one example in the original
training-data given to us? (kind of a trivial question yet important)
Answer:
yes
-
Question:
Another question I have regarding problem 2 and learning decision trees in
general. First, in order to find the root node, you take the total
information to be learned, find the information learned by a certain
attribute and minus that from the total information to be learned. This
result is your information gain. Correct? For example, if the total
information to be learned is 1 bit, and the highest information value for
an attribute is .5, then 1 - .5 = .5 which would be the total information
gain for splitting on that attribute. If that was the highest information
gain of all the attributes then that particular attribute would be your
root node.
Now my question is what happens when you're finding the next node. What is
the total information that you subtract from? The information to be learned
for the entire problem when finding the root node was 1 bit. So we
subtracted the information to be learned from an attribute from this total
to get .5. Now what do we subtract from? When we evaluate further nodes, to
we still subtract that value from 1 in order to get the information gain?
Or do we subtract it from the total we received from the root node, .5? Or
do we subtract it from some other number?
Answer:
there is no "highest information value for an attribute". you split
the examples according to the values of one particular attribute,
calculate the information still required to classify each subset of
examples, and then compute an 'expected value' of the information
required after splitting on the attribute. you would pick an
attribute for the root node of the (sub)tree that had the greatest
information gain, as you otherwise described.
the recursive calls to the algorithm take each subset of the training
data from a split. the information required to classify those
examples would be the "starting information" for the recursive call.
-
Question:
I think my approach to the discretize function is slighltly too extensive.
What I have been doing is creating a list of breaks between the
classification of a single attribute. For example: If attribute 1 (called
'ave-radius) had the following values in the training data:
(ave-radius-values (0 4 7 2 5 8 9))
and they were classified as
(ave-radius-class (b m b b b m m))
then i sort the data with their classifications:
((0 b)
(2 b)
(4 m)
(5 b)
(7 b)
(8 m)
(9 m))
then i look for the changes in classifications ("breaks") and combine
those lists.
((0 - 2) ;from 0 to 2 all were classified as b
4 ;4 was classified as m
(5 - 7) ;from 5 to 7 all were classified as b
(8 - 9)) ;from 8 to 9 all were classified as m
Now here is where I get confused. You choose one value for the threshold.
I thought we were supposed to choose threshold values dependent on where
the changes occur in classification. In this instance I would have 4
threshold values like the list above. But you seem to choose one threshold
value dependent on the 1st initial change in classification.
What I am really asking is, which is the method you taught us in class.
Was it
- Find all threshold values where the changes occur in the sorted list,
Or was it
- Find one threshold value which is determined by the first change
in classification of the sorted list.
Answer:
the method we discussed in class was to find all the values where the
goal attribute changes and then test each one to see which is the best
to use for a threshold which would decide between two (discrete)
values. Of course, you could extend this even further and have
multiple thresholds to decide between more than two discrete values,
but (at least in this case) this shouldn't be necessary.
-
Question:
I have a quick question regarding problem 1 of the current assignment. It
states that we need to start the learning tree. So all you want us to do is
find the attribute that would be set as the root node and stop? You don't
want us to figure out the entire decision tree? I'm just not completely
clear on this, because this is what I thought we were supposed to do on the
last quiz, but then I later saw when you went over the quiz that apparently
on it we were supposed to figure out the entire tree. I would hate to make
that mistake again.
Answer:
Yes, you only need to determine the attribute at
the root node of the decision tree for Problem 1 of the assignment.