; although this purports to be an example of using your ; basic-rl-strategy procedure from problem 2, it also uses the ; temporal difference learning from problem 4. ; this is necessary because the basic-rl-strategy needs utility values ; to work. i could have set them manually, of course (by using the ; set-utility-element procedure or by saving the tables, editing the ; file, and then loading it). ; ; the stuff i do below is in three main steps: ; ; 1) "learn" transition probabilities and rewards by playing a bunch ; of hands using a random player ; ; 2) learn utilities using temporal differencing ; ; 3) play blackjack using those learned utilities with the ; basic-rl-strategy procedure ; (load "a7example") ;Loading "a7example.scm" ;Loading "a7code.com" -- done Assignment 7 support code Version 1.2 CSCI 4150 Introduction to Artificial Intelligence, Fall 2005 Copyright (c) 1999--2005 by Wesley H. Huang. All rights reserved. -- done ;Value: player-bob (load "solutions7") ;Loading "solutions7.scm" -- done ;Value: print-top-differences ; first, I'm going to use the random player (player-bob) just to learn ; transition probabilities and rewards. ; ; turn off printing for this... ; (define print-narration #f) ;Value: print-narration (define print-learning #f) ;Value: print-learning ; only print progress messages every 100 hands (define print-match-progress 100) ;Value: print-match-progress ; call the init-tables procedure in a7example.scm (init-tables) ;Value: done (play-match 1000 (player-bob)) **** HAND 100 **** HAND 200 **** HAND 300 **** HAND 400 **** HAND 500 **** HAND 600 **** HAND 700 **** HAND 800 **** HAND 900 **** HAND 1000 ;Value 1: (-465. 1310.) (print-tables) TRANSITION PROBABILITIES - state: action to-state: probability ... 0: hit 0: 0.202, 1: 0.567, 6: 0.231 0: stand 2: 1.000 0: double 4: 1.000 1: hit 0: 0.015, 1: 0.317, 6: 0.668 1: stand 3: 1.000 1: double 5: 1.000 REWARDS printed as: state-num: ave-reward (visits) 0: 0.000 (534) 2: -0.430 (179) 4: -0.503 (147) 6: -1.000 (185) 1: 0.000 (604) 3: 0.076 (236) 5: -1.129 (163) UTILITIES (brackets indicate average reward for terminal states) 0: 0.000 2: [-0.430] 4: [-0.503] 6: [-1.000] 1: 0.000 3: [ 0.076] 5: [-1.129] ;Value: #t ; i can save these transition probabilities and rewards (and ; utilities) for future use: ; (save-tables "a7p2tables.scm") ;Value: done ; i'm going to turn off the table updates. this will allow me to ; learn utilities without any changes to the transition probabilities ; and to the rewards. (You do not have to do this, but I am doing so ; here just to illustrate the different ways you can do this.) (define enable-table-updates #f) ;Value: enable-table-updates ; here's a little fuction to create a player that selects actions ; randomly but will use temporal differencing to learn the utilities ; (define (rl-learner) (list "Alice" random-strategy (create-td-learning alpha-fn))) ;Value: rl-learner (play-match 1000 (rl-learner)) **** HAND 100 **** HAND 200 **** HAND 300 **** HAND 400 **** HAND 500 **** HAND 600 **** HAND 700 **** HAND 900 **** HAND 1000 ;Value 2: (-488. 1315.) ; the random player didn't do so well - it lost 488 out of total bets of 1315! ; that's ok, we just wanted to learn the utilities anyway... ; let's look at utilities (print-utilities) UTILITIES (brackets indicate average reward for terminal states) 0: -0.503 2: [-0.430] 4: [-0.503] 6: [-1.000] 1: -0.403 3: [ 0.076] 5: [-1.129] ;Value: #t ; now, i'll just use a player that makes decisions based on these ; utilities (and the rewards). this will use the basic-rl-strategy ; procedure that you'll write for problem 2. ; (define (utility-player) (list "Frank" basic-rl-strategy non-learning-procedure)) ;Value: utility-player (play-match 1000 (utility-player)) **** HAND 100 **** HAND 200 **** HAND 300 **** HAND 400 **** HAND 500 **** HAND 600 **** HAND 700 **** HAND 800 **** HAND 900 **** HAND 1000 ;Value 3: (-156. 1000.) ; this does much better: only loosing 156 out of a total of 1000 bets. ; evidently, the best actions don't include "double down" ; let's see what the policy is (for the nonterminal states 0 and 1) ; (map (lambda (rl-state) (basic-rl-strategy rl-state '(hit stand double-down))) '(0 1)) ;Value 5: (stand stand)