CSCI 4150: Introduction to Artificial Intelligence, Fall 2004 |
Instead of returning a procedure, they will just return threshold values. create-binary-discretization should return a number, and create-multi-discretization should return a list of two numbers (with the lower threshold first).
One other change: these procedures will no longer take a values argument. I have changed the assign6.scm stubs file above to reflect this.
See the Discretization section for more details and examples.
version 1.4.1, released 11/13: This version includes the procedure discretize-tdata, described in the discretization section below. There are no other changes to the support code.
Note that the last three data sets actually define different data sets for that domain. To properly gauge the accuracy of the decision tree, you should learn the decision tree using one data set and test it using a different data set.
Learn a decision tree from the discretized version of the data set using your learn-dtree procedure. Include the resulting decision tree in your writeup.
Weight | Color | Goal --------------------- 1 | Red | No 1 | Green | No 1 | ? | Yes 1 | Green | Yes 1 | Blue | YesNote that the color "red" is 1/4 of the attribute values, "green" is 1/2, and "blue" is 1/4.
If we were to split on the attribute "color", then the examples recursively propogated to the three subtrees are (respectively):
Weight | Color | Goal Weight | Color | Goal Weight | Color | Goal --------------------- --------------------- --------------------- 1 | Red | No 1 | Green | No 1 | Blue | Yes 0.25 | ? | Yes 0.5 | ? | Yes 0.25 | ? | Yes 1 | Green | Yes
Divide training data into groups according to the specified attribute. For example:
> loan-names ;Value: (house bills income credit) > loan-data-small ;Value: ((no (own late high good)) ; (yes (own on-time medium good)) ; (no (own on-time high good)) ; (no (own on-time high good))) > (split-tdata loan-data-small loan-names 'bills) ;Value: ((on-time ((no (own on-time high good)) ; (no (own on-time high good)) ; (yes (own on-time medium good)))) ; (late ((no (own late high good)))))
The order of the attribute values is not guaranteed.
Note that this procedure will also work with weighted training data.
Given a list of training data, it counts how many occurrences there are for each goal predicate value. For example:
> (tally-tdata loan-data-small) Value: ((yes 1) (no 3))
The order of the goal predicate values is not guaranteed.
Note that this procedure will also work with weighted training data
Returns a classification of the example using the given decision-tree. If the example has an attribute value not found in the decision tree, the default-value is returned.
For example:
(define loan-dtree-example '(income (high yes) (low (house (rent no) (own yes))))) > (classify '(rent late high good) loan-dtree-example loan-names 'No) ;Value: yes > (classify '(rent late medium good) loan-dtree-example loan-names 'No) ;Value: no
Takes a training data set, strips the goal attribute out to create a list of examples which it then classifies using the given decision tree. The classifications returned by the decision tree are compared against the given goal attribute for each example.
For example:
> (test-dtree loan-dtree-example loan-data-small loan-names) The example: (own late high good) was classified as: yes WRONG: the correct classification is no. The example: (own on-time medium good) was classified as: default-answer WRONG: the correct classification is yes. The example: (own on-time high good) was classified as: yes WRONG: the correct classification is no. The example: (own on-time high good) was classified as: yes WRONG: the correct classification is no. Out of 4 examples, 0 were correctly classified.
Takes the logarithm of x to base 2. Signals an error if you try to take the logarithm of a nonpositive number.
For example:
(log2 8) ;Value: 3.
Takes zero or more arguments and prints them to the screen.
The support code will allow you to weight training examples to accomplish this. Whereas a regular training example is of the form:
(Yes (Own Late Low Bad))A weighted training example has a number inserted as the first element of this list:
(1 Yes (Own Late Low Bad))The support code procedures split-tdata and tally-tdata will handle weighted training data.
(define weighted-loan-data-small '((0.5 No (Own Late High Good)) (0.3 Yes (Own On-time Medium Good)) (0.3 No (Own On-time High Good)) (1 No (Own On-time High Good)))) (split-tdata weighted-loan-data-small loan-names 'bills) ;Value: ((on-time ((1 no (own on-time high good)) ; (.3 no (own on-time high good)) ; (.3 yes (own on-time medium good)))) ; (late ((.5 no (own late high good))))) (tally-tdata weighted-loan-data-small) ;Value: ((yes .3) (no 1.8))Please note that all training data must be the same, i.e. all unweighted training examples or all weighted training examples.
Sorts the given list. The second argument must be a procedure of two arguments that returns true if the first argument should come before the second argument. This procedure should not return #t if the arguments are "equal". (See the MIT/GNU Scheme reference manual (Section 7.9) for more information.)
For example:
(sort '(43 8 2 88 9) <) ;Value: (2 8 9 43 88)
The second argument a-list must be an association list. This is a list where each element is an association. Each association is a list where the first element is the key. The assoc procedure takes the given key and finds the association in a-list that matches (as per equal?) the key.
Note that the splits returned by split-tdata have the form of an associaton list.
For example:
(assoc 'red '((blue 7) (red 6) (green 5))) ;Value: (red 6)
Deletes all copies of the given element el from the list lst. The equal? predicate is used to determine if the element matches el. Actually, this doesn't really delete elements from the list — it makes a copy of the list without any element equal? to el.
For example:
(delete 4 '(1 3 5 4 2.5 4.0 9 4)) ;Value: (1 3 5 2.5 4. 9)Note that the value 4.0 was not removed because it is not equal? to 4.
(create-binary-discretization c-attribute training-data attribute-names)and
(create-multi-discretization c-attribute training-data attribute-names)which will decide upon threshold value(s) for the continuous valued attribute c-attribute.
For example, suppose we have the following data set:
(define continuous-names '(age weight height)) (define continuous-tdata '((republican (23 135 177)) (democrat (42 225 190)) (democrat (67 170 150)) (republican (18 155 187)) (democrat (32 190 195))))I wrote a "fake" version of create-binary-discretization that doesn't do the proper calculation, but it does return what it should — a number. For example:
(create-binary-discretization 'age continuous-tdata continuous-names) For the attribute "age", max=67, min=18, so threshold is 42.5 ;Value: 42.5I have my procedure print out a little information about the threshold, as you can see.
The create-multi-discretization procedure is similar, but it must return two thresholds so that the continuous-valued attribute is turned into a discrete attribute with three values. These two thresholds must be returned in a list of two numbers with the lower number first.
For example, my "fake" version of this procedure does the following:
(create-multi-discretization 'height continuous-tdata continuous-names) For the attribute "height", max=195, min=150, so thresholds are 165. and 180. ;Value: (165. 180.)Again, I have my procedure print out a little information on the threshold.
Now, you usually don't want to just discretize one continuous-valued attribute — you want to discretize all of them! I have therefore provided the procedure discretize-tdata (available in the new version of a6code.com) which will apply the create-X-discretization procedure to all the continuous valued attributes. Of course, you have to specify which attributes are continuous-valued, and you must also specify what symbols you want to use for the discrete values.
For example:
(discretize-tdata create-binary-discretization '(weight age height) '((light heavy) ; symbols for weight (young old) ; symbols for age (short tall)) ; symbols for height continuous-tdata continuous-names) For the attribute "weight", max=225, min=135, so threshold is 180. For the attribute "age", max=67, min=18, so threshold is 42.5 For the attribute "height", max=195, min=150, so threshold is 172.5 ;Value: ((republican (young light tall)) ; (democrat (young heavy tall)) ; (democrat (old light short)) ; (republican (young light tall)) ; (democrat (young heavy tall)))And:
(discretize-tdata create-multi-discretization '(weight age height) '((light medium heavy) ; symbols for weight (young middle-aged old) ; symbols for age (short average tall)) ; symbols for height continuous-tdata continuous-names) For the attribute "weight", max=225, min=135, so thresholds are 165. and 195. For the attribute "age", max=67, min=18, so thresholds are 34.3 and 50.7. For the attribute "height", max=195, min=150, so thresholds are 165. and 180. ;Value: ((republican (young light average)) ; (democrat (middle-aged heavy tall)) ; (democrat (old medium short)) ; (republican (young light tall)) ; (democrat (young medium tall)))
Here are the "fake" discretization procedures that I used above:
(define (create-binary-discretization c-attribute training-data attribute-names) ; here's my fake example --- it finds the max and the min values ; and takes the average. (let* ((data-pos (- (length attribute-names) (length (member c-attribute attribute-names)))) (attr-values (map (lambda (x) (list-ref (second x) data-pos)) training-data)) (minval (apply min attr-values)) (maxval (apply max attr-values)) (t (/ (+ minval maxval) 2.0))) (print "For the attribute \"" c-attribute "\", max=" maxval ", min=" minval ", so threshold is " t "\n") t)) (define (create-multi-discretization c-attribute training-data attribute-names) ; here's my fake example --- it finds the max and the min values ; and picks thresholds 1/3 and 2/3 in between (let* ((data-pos (- (length attribute-names) (length (member c-attribute attribute-names)))) (attr-values (map (lambda (x) (list-ref (second x) data-pos)) training-data)) (minval (apply min attr-values)) (maxval (apply max attr-values)) (diff (- maxval minval)) (t1 (+ (/ diff 3.0) minval)) (t2 (+ (/ diff 3.0) t1))) (print "For the attribute \"" c-attribute "\", max=" maxval ", min=" minval ", so thresholds are " t1 " and " t2 "\n") (list t1 t2)))
This web tester runs a number of tests on your binary-discretization procedure, some with and some without missing continuous-valued attributes.
This web tester only runs tests for 10 (out of 20) points for this problem. The remaining points will be assigned after we read your writeup on what you did for this procedure. It makes sure that the thresholds you return actually discretize the attribute into three values (not just one or two), and it tests whether your discretization provides at least as much information gain as a binary discretization.
The Policy for Electronic Submission is the same as for previous assignments. The syntax-checker is the same as for the last assignment.