; this is an example of using your create-exploring-rl-strategy
; procedure with temporal differencing (the create-td-learning
; procedure) in order to simultaneously learn the model of the world
; (transition probabilities and rewards) and the utilities.
;

(load "a7example")
;Loading "a7example.scm"
;Loading "a7code.com" -- done
Assignment 7 support code Version 1.2
CSCI 4150 Introduction to Artificial Intelligence, Fall 2005
Copyright (c) 1999--2005 by Wesley H. Huang.  All rights reserved.
 -- done
;Value: player-bob

(load "solutions7")
;Loading "solutions7.scm" -- done

; first call the init-tables procedure in the a7example.scm file
(init-tables)
;Value: done

; in my solutions7.scm file, I have the following procedure defined:
; (define (td-player)
;   (list "TD-player" 
; 	(create-exploring-rl-strategy R+ Ne)
; 	(create-td-learning alpha-fn)))
;
; where I have used some numeric values for Ne and R+ and have defined a
; specific alpha-fn function.
;
; also in my file, I have written the create-exploring-rl-strategy
; procedure (problem 3) and the create-td-learning procedure (problem
; 4).  

; first I'll play 10 hands with all the output turned on...
;
(play-match 10 (td-player))
****************
**** HAND 1
****************
Dealer is dealt (2-D) J-D = 12
Player TD-player is dealt A-H A-S = 12

Checking with TD-player
  TD-player doubles down and gets K-H for a new total of 12
Checking with Dealer
  Dealer takes a hit and gets 3-C for a new total of 15
  Dealer takes a hit and gets K-D for a new total of 25
  Dealer BUSTS!

Dealer finishes with 25 --- busted!
Player TD-player wins.
[LEARNING: fs=0, action=double-down, ts=4 (terminal), reward=2.]

****************
**** HAND 2
****************
Dealer is dealt (4-H) K-S = 14
Player TD-player is dealt A-D 8-C = 19

Checking with TD-player
  TD-player doubles down and gets 5-S for a new total of 14
Checking with Dealer
  Dealer takes a hit and gets 2-D for a new total of 16
  Dealer takes a hit and gets 9-C for a new total of 25
  Dealer BUSTS!

Dealer finishes with 25 --- busted!
Player TD-player wins.
[LEARNING: fs=1, action=double-down, ts=5 (terminal), reward=2.]

****************
**** HAND 3
****************
Dealer is dealt (10-H) 6-H = 16
Player TD-player is dealt J-D 10-H = 20

Checking with TD-player
  TD-player takes a hit and gets 10-S for a new total of 30
  TD-player BUSTS!

[LEARNING: fs=1, action=hit, ts=6 (terminal), reward=-1.]
Checking with Dealer
  Dealer takes a hit and gets 6-D for a new total of 22
  Dealer BUSTS!

Dealer finishes with 22 --- busted!
Player TD-player busted first and therefore loses.

****************
**** HAND 4
****************
Dealer is dealt (3-H) J-S = 13
Player TD-player is dealt 6-H 4-S = 10

Checking with TD-player
  TD-player doubles down and gets 7-C for a new total of 17
Checking with Dealer
  Dealer takes a hit and gets 2-S for a new total of 15
  Dealer takes a hit and gets 2-C for a new total of 17
  Dealer stands.

Dealer finishes with 17
Player TD-player finishes with 17 ---  a push (i.e. tie)
[LEARNING: fs=0, action=double-down, ts=4 (terminal), reward=0.]

****************
**** HAND 5
****************
Dealer is dealt (K-D) 10-H = 20
Player TD-player is dealt 8-H 2-S = 10

Checking with TD-player
  TD-player doubles down and gets 8-S for a new total of 18
Checking with Dealer
  Dealer stands.

Dealer finishes with 20
Player TD-player finishes with 18 ---  you lose... try again next time.
[LEARNING: fs=0, action=double-down, ts=4 (terminal), reward=-2.]

****************
**** HAND 6
****************
Dealer is dealt (8-C) 4-S = 12
Player TD-player is dealt 2-C 3-D = 5

Checking with TD-player
  TD-player doubles down and gets 4-S for a new total of 9
Checking with Dealer
  Dealer takes a hit and gets 9-S for a new total of 21
  Dealer stands.

Dealer finishes with 21
Player TD-player finishes with 9 ---  you lose... try again next time.
[LEARNING: fs=0, action=double-down, ts=4 (terminal), reward=-2.]

****************
**** HAND 7
****************
Dealer is dealt (J-H) A-S = 21
Player TD-player is dealt Q-S 2-D = 12

Dealer has blackjack.  The dealer wins!

****************
**** HAND 8
****************
Dealer is dealt (Q-C) 4-C = 14
Player TD-player is dealt 7-D Q-D = 17

Checking with TD-player
  TD-player stands.

Checking with Dealer
  Dealer takes a hit and gets Q-D for a new total of 24
  Dealer BUSTS!

Dealer finishes with 24 --- busted!
Player TD-player wins.
[LEARNING: fs=1, action=stand, ts=3 (terminal), reward=1.]

****************
**** HAND 9
****************
Dealer is dealt (J-D) 9-C = 19
Player TD-player is dealt Q-S K-C = 20

Checking with TD-player
  TD-player takes a hit and gets 5-D for a new total of 25
  TD-player BUSTS!

[LEARNING: fs=1, action=hit, ts=6 (terminal), reward=-1.]
Checking with Dealer
  Dealer stands.

Dealer finishes with 19
Player TD-player finishes with 25 --- busted!

****************
**** HAND 10
****************
Dealer is dealt (7-C) 7-S = 14
Player TD-player is dealt J-D A-C = 21

Checking with TD-player
  TD-player has blackjack!

Checking with Dealer
  Dealer takes a hit and gets K-C for a new total of 24
  Dealer BUSTS!

Dealer finishes with 24 --- busted!
Player TD-player has blackjack!


After 10 hands,
Player TD-player has a score of -.5
;Value 2: (-.5 15.)

; we can now look at the tables.  the estimates of the transition
; probabilities and rewards learned from these 10 hands won't be very
; good...
;
(print-tables)

TRANSITION PROBABILITIES - state: action   to-state: probability ...

0: hit    
0: stand  
0: double 4: 1.000

1: hit    6: 1.000
1: stand  3: 1.000
1: double 5: 1.000


REWARDS     printed as:    state-num: ave-reward (visits)

0:  0.000 (4)   2:  0.000 (0)   4: -0.500 (4)   6: -1.000 (2)
1:  0.000 (4)   3:  1.000 (1)   5:  2.000 (1)                

UTILITIES     (brackets indicate average reward for terminal states)

0:  -0.475    2: [ 0.000]   4: [-0.500]   6: [-1.000]
1:  -0.908    3: [ 1.000]   5: [ 2.000]              
;Value: #t

; turn off printing before running more hands...
;
(define print-narration #f)
;Value: print-narration

(define print-learning #f)
;Value: print-learning

(play-match 100 (td-player))
**** HAND 10
**** HAND 20
**** HAND 30
**** HAND 40
**** HAND 50
**** HAND 60
**** HAND 70
**** HAND 80
**** HAND 90
**** HAND 100
;Value 3: (-30.5 115.)

; now the transition probabilities and rewards (and the utilities)
; should be better.
; 
(print-tables)

TRANSITION PROBABILITIES - state: action   to-state: probability ...

0: hit    0: 0.220, 1: 0.580, 6: 0.200
0: stand  2: 1.000
0: double 4: 1.000

1: hit    1: 0.400, 6: 0.600
1: stand  3: 1.000
1: double 5: 1.000


REWARDS     printed as:    state-num: ave-reward (visits)

0:  0.000 (70)   2: -0.600 (10)   4: -0.600 (10)   6: -1.000 (16)
1:  0.000 (74)   3:  0.037 (54)   5: -0.600 (10)                 

UTILITIES     (brackets indicate average reward for terminal states)

0:  -0.568    2: [-0.600]   4: [-0.600]   6: [-1.000]
1:   0.046    3: [ 0.037]   5: [-0.600]              
;Value: #t