; this is an example of using your create-exploring-rl-strategy ; procedure with temporal differencing (the create-td-learning ; procedure) in order to simultaneously learn the model of the world ; (transition probabilities and rewards) and the utilities. ; (load "a7example") ;Loading "a7example.scm" ;Loading "a7code.com" -- done Assignment 7 support code Version 1.2 CSCI 4150 Introduction to Artificial Intelligence, Fall 2005 Copyright (c) 1999--2005 by Wesley H. Huang. All rights reserved. -- done ;Value: player-bob (load "solutions7") ;Loading "solutions7.scm" -- done ; first call the init-tables procedure in the a7example.scm file (init-tables) ;Value: done ; in my solutions7.scm file, I have the following procedure defined: ; (define (td-player) ; (list "TD-player" ; (create-exploring-rl-strategy R+ Ne) ; (create-td-learning alpha-fn))) ; ; where I have used some numeric values for Ne and R+ and have defined a ; specific alpha-fn function. ; ; also in my file, I have written the create-exploring-rl-strategy ; procedure (problem 3) and the create-td-learning procedure (problem ; 4). ; first I'll play 10 hands with all the output turned on... ; (play-match 10 (td-player)) **************** **** HAND 1 **************** Dealer is dealt (2-D) J-D = 12 Player TD-player is dealt A-H A-S = 12 Checking with TD-player TD-player doubles down and gets K-H for a new total of 12 Checking with Dealer Dealer takes a hit and gets 3-C for a new total of 15 Dealer takes a hit and gets K-D for a new total of 25 Dealer BUSTS! Dealer finishes with 25 --- busted! Player TD-player wins. [LEARNING: fs=0, action=double-down, ts=4 (terminal), reward=2.] **************** **** HAND 2 **************** Dealer is dealt (4-H) K-S = 14 Player TD-player is dealt A-D 8-C = 19 Checking with TD-player TD-player doubles down and gets 5-S for a new total of 14 Checking with Dealer Dealer takes a hit and gets 2-D for a new total of 16 Dealer takes a hit and gets 9-C for a new total of 25 Dealer BUSTS! Dealer finishes with 25 --- busted! Player TD-player wins. [LEARNING: fs=1, action=double-down, ts=5 (terminal), reward=2.] **************** **** HAND 3 **************** Dealer is dealt (10-H) 6-H = 16 Player TD-player is dealt J-D 10-H = 20 Checking with TD-player TD-player takes a hit and gets 10-S for a new total of 30 TD-player BUSTS! [LEARNING: fs=1, action=hit, ts=6 (terminal), reward=-1.] Checking with Dealer Dealer takes a hit and gets 6-D for a new total of 22 Dealer BUSTS! Dealer finishes with 22 --- busted! Player TD-player busted first and therefore loses. **************** **** HAND 4 **************** Dealer is dealt (3-H) J-S = 13 Player TD-player is dealt 6-H 4-S = 10 Checking with TD-player TD-player doubles down and gets 7-C for a new total of 17 Checking with Dealer Dealer takes a hit and gets 2-S for a new total of 15 Dealer takes a hit and gets 2-C for a new total of 17 Dealer stands. Dealer finishes with 17 Player TD-player finishes with 17 --- a push (i.e. tie) [LEARNING: fs=0, action=double-down, ts=4 (terminal), reward=0.] **************** **** HAND 5 **************** Dealer is dealt (K-D) 10-H = 20 Player TD-player is dealt 8-H 2-S = 10 Checking with TD-player TD-player doubles down and gets 8-S for a new total of 18 Checking with Dealer Dealer stands. Dealer finishes with 20 Player TD-player finishes with 18 --- you lose... try again next time. [LEARNING: fs=0, action=double-down, ts=4 (terminal), reward=-2.] **************** **** HAND 6 **************** Dealer is dealt (8-C) 4-S = 12 Player TD-player is dealt 2-C 3-D = 5 Checking with TD-player TD-player doubles down and gets 4-S for a new total of 9 Checking with Dealer Dealer takes a hit and gets 9-S for a new total of 21 Dealer stands. Dealer finishes with 21 Player TD-player finishes with 9 --- you lose... try again next time. [LEARNING: fs=0, action=double-down, ts=4 (terminal), reward=-2.] **************** **** HAND 7 **************** Dealer is dealt (J-H) A-S = 21 Player TD-player is dealt Q-S 2-D = 12 Dealer has blackjack. The dealer wins! **************** **** HAND 8 **************** Dealer is dealt (Q-C) 4-C = 14 Player TD-player is dealt 7-D Q-D = 17 Checking with TD-player TD-player stands. Checking with Dealer Dealer takes a hit and gets Q-D for a new total of 24 Dealer BUSTS! Dealer finishes with 24 --- busted! Player TD-player wins. [LEARNING: fs=1, action=stand, ts=3 (terminal), reward=1.] **************** **** HAND 9 **************** Dealer is dealt (J-D) 9-C = 19 Player TD-player is dealt Q-S K-C = 20 Checking with TD-player TD-player takes a hit and gets 5-D for a new total of 25 TD-player BUSTS! [LEARNING: fs=1, action=hit, ts=6 (terminal), reward=-1.] Checking with Dealer Dealer stands. Dealer finishes with 19 Player TD-player finishes with 25 --- busted! **************** **** HAND 10 **************** Dealer is dealt (7-C) 7-S = 14 Player TD-player is dealt J-D A-C = 21 Checking with TD-player TD-player has blackjack! Checking with Dealer Dealer takes a hit and gets K-C for a new total of 24 Dealer BUSTS! Dealer finishes with 24 --- busted! Player TD-player has blackjack! After 10 hands, Player TD-player has a score of -.5 ;Value 2: (-.5 15.) ; we can now look at the tables. the estimates of the transition ; probabilities and rewards learned from these 10 hands won't be very ; good... ; (print-tables) TRANSITION PROBABILITIES - state: action to-state: probability ... 0: hit 0: stand 0: double 4: 1.000 1: hit 6: 1.000 1: stand 3: 1.000 1: double 5: 1.000 REWARDS printed as: state-num: ave-reward (visits) 0: 0.000 (4) 2: 0.000 (0) 4: -0.500 (4) 6: -1.000 (2) 1: 0.000 (4) 3: 1.000 (1) 5: 2.000 (1) UTILITIES (brackets indicate average reward for terminal states) 0: -0.475 2: [ 0.000] 4: [-0.500] 6: [-1.000] 1: -0.908 3: [ 1.000] 5: [ 2.000] ;Value: #t ; turn off printing before running more hands... ; (define print-narration #f) ;Value: print-narration (define print-learning #f) ;Value: print-learning (play-match 100 (td-player)) **** HAND 10 **** HAND 20 **** HAND 30 **** HAND 40 **** HAND 50 **** HAND 60 **** HAND 70 **** HAND 80 **** HAND 90 **** HAND 100 ;Value 3: (-30.5 115.) ; now the transition probabilities and rewards (and the utilities) ; should be better. ; (print-tables) TRANSITION PROBABILITIES - state: action to-state: probability ... 0: hit 0: 0.220, 1: 0.580, 6: 0.200 0: stand 2: 1.000 0: double 4: 1.000 1: hit 1: 0.400, 6: 0.600 1: stand 3: 1.000 1: double 5: 1.000 REWARDS printed as: state-num: ave-reward (visits) 0: 0.000 (70) 2: -0.600 (10) 4: -0.600 (10) 6: -1.000 (16) 1: 0.000 (74) 3: 0.037 (54) 5: -0.600 (10) UTILITIES (brackets indicate average reward for terminal states) 0: -0.568 2: [-0.600] 4: [-0.600] 6: [-1.000] 1: 0.046 3: [ 0.037] 5: [-0.600] ;Value: #t