; although this purports to be an example of using your
; basic-rl-strategy procedure from problem 2, it also uses the
; temporal difference learning from problem 4.  

; this is necessary because the basic-rl-strategy needs utility values
; to work.  i could have set them manually, of course (by using the
; set-utility-element procedure or by saving the tables, editing the
; file, and then loading it).
;
; the stuff i do below is in three main steps:
;
; 1) "learn" transition probabilities and rewards by playing a bunch
;    of hands using a random player
;
; 2) learn utilities using temporal differencing
;
; 3) play blackjack using those learned utilities with the
;    basic-rl-strategy procedure
;

(load "a7example")
;Loading "a7example.scm"
;Loading "a7code.com" -- done
Assignment 7 support code Version 1.2
CSCI 4150 Introduction to Artificial Intelligence, Fall 2005
Copyright (c) 1999--2005 by Wesley H. Huang.  All rights reserved.
 -- done
;Value: player-bob

(load "solutions7")
;Loading "solutions7.scm" -- done
;Value: print-top-differences

; first, I'm going to use the random player (player-bob) just to learn
; transition probabilities and rewards.
;
; turn off printing for this...
;
(define print-narration #f)
;Value: print-narration

(define print-learning #f)
;Value: print-learning

; only print progress messages every 100 hands
(define print-match-progress 100)
;Value: print-match-progress

; call the init-tables procedure in a7example.scm
(init-tables)
;Value: done

(play-match 1000 (player-bob))
**** HAND 100
**** HAND 200
**** HAND 300
**** HAND 400
**** HAND 500
**** HAND 600
**** HAND 700
**** HAND 800
**** HAND 900
**** HAND 1000
;Value 1: (-465. 1310.)

(print-tables)

TRANSITION PROBABILITIES - state: action   to-state: probability ...

0: hit    0: 0.202, 1: 0.567, 6: 0.231
0: stand  2: 1.000
0: double 4: 1.000

1: hit    0: 0.015, 1: 0.317, 6: 0.668
1: stand  3: 1.000
1: double 5: 1.000


REWARDS     printed as:    state-num: ave-reward (visits)

0:  0.000 (534)   2: -0.430 (179)   4: -0.503 (147)   6: -1.000 (185)
1:  0.000 (604)   3:  0.076 (236)   5: -1.129 (163)                  

UTILITIES     (brackets indicate average reward for terminal states)

0:   0.000    2: [-0.430]   4: [-0.503]   6: [-1.000]
1:   0.000    3: [ 0.076]   5: [-1.129]              
;Value: #t

; i can save these transition probabilities and rewards (and
; utilities) for future use:
; 
(save-tables "a7p2tables.scm")
;Value: done

; i'm going to turn off the table updates.  this will allow me to
; learn utilities without any changes to the transition probabilities
; and to the rewards.  (You do not have to do this, but I am doing so
; here just to illustrate the different ways you can do this.)
(define enable-table-updates #f)
;Value: enable-table-updates


; here's a little fuction to create a player that selects actions
; randomly but will use temporal differencing to learn the utilities
;
(define (rl-learner)
  (list "Alice"
	random-strategy
	(create-td-learning alpha-fn)))
;Value: rl-learner

(play-match 1000 (rl-learner))
**** HAND 100
**** HAND 200
**** HAND 300
**** HAND 400
**** HAND 500
**** HAND 600
**** HAND 700
**** HAND 900
**** HAND 1000
;Value 2: (-488. 1315.)

; the random player didn't do so well - it lost 488 out of total bets of 1315!
; that's ok, we just wanted to learn the utilities anyway...

; let's look at utilities
(print-utilities)

UTILITIES     (brackets indicate average reward for terminal states)

0:  -0.503    2: [-0.430]   4: [-0.503]   6: [-1.000]
1:  -0.403    3: [ 0.076]   5: [-1.129]              
;Value: #t

; now, i'll just use a player that makes decisions based on these
; utilities (and the rewards).  this will use the basic-rl-strategy
; procedure that you'll write for problem 2.
; 
(define (utility-player)
  (list "Frank" 
	basic-rl-strategy
	non-learning-procedure))
;Value: utility-player

(play-match 1000 (utility-player))
**** HAND 100
**** HAND 200
**** HAND 300
**** HAND 400
**** HAND 500
**** HAND 600
**** HAND 700
**** HAND 800
**** HAND 900
**** HAND 1000
;Value 3: (-156. 1000.)

; this does much better: only loosing 156 out of a total of 1000 bets.
; evidently, the best actions don't include "double down"

; let's see what the policy is (for the nonterminal states 0 and 1)
;
(map (lambda (rl-state)
       (basic-rl-strategy rl-state '(hit stand double-down)))
     '(0 1))
;Value 5: (stand stand)