1CS 188: Artificial IntelligenceSpring 2006Lecture 25: Games II4/20/2006Dan Klein – UC BerkeleyRecap: Minimax TreesMinimax Search DFS Minimaxα-β Pruning Example [Code in book]α-β Pruning General configuration α is the best value (to MAX) found so far off the current path If V is worse than α, MAX will avoid it, so prune V’s branch Define β similarly for MIN2α-β Pruning Properties Pruning has no effect on final result Good move ordering improves effectiveness of pruning With “perfect ordering”: Time complexity drops to O(bm/2) Doubles solvable depth Full search of, e.g. chess, is still hopeless! A simple example of metareasoning, here reasoning about which computations are relevantResource Limits Cannot search to leaves Limited search Instead, search a limited portion of the tree Replace terminal utilities with an evalfunction for non-terminal positions Guarantee of optimal play is gone Example: Suppose we have 100 seconds, can explore 10K nodes / sec So can check 1M nodes per move α-β reaches about depth 8 – decent chess programEvaluation Functions Function which scores non-terminals Ideal function: returns the utility of the position In practice: typically weighted linear sum of features: e.g. f1(s) = (num white queens – num black queens), etc.Function Approximation Problem: inefficient to learn each state’s utility (or evalfunction) one by one Solution: what we learn about one state (or position) should generalize to similar states Very much like supervised learning If states are treated entirely independently, we can only learn on very small state spacesLinear Value Functions Another option: values are linear functions of features of states (or action-state pairs) Good if you can describe states well using a few features (e.g. for game playing board evaluations) Now we only have to learn a few weights rather than a value for each state0.600.700.80 0.850.65 0.700.800.900.750.850.95Recap: Model-Free Learning Recall MDP value updates for a given estimate of U If you know the model T, use Bellman update Temporal difference learning (TD) Make (epsilon greedy) action choice (or follow a provided policy) Update using results of the action3Example: Tabular Value Updates Example: Blackjack +1 for win, -1 for loss or bust, 0 for tie Our hand shows 14, current policy says “hit” Current U(s) is 0.5 We hit, get an 8, bust (end up in s’ = “lose”) Update Old U(s) = 0.5 Observed R(s) = 0 Old U(s’) = -1 New U(s) = U(s) + α [ γ (R(s) + U(s’) – U(s) ] If α = 0.1, γ = 1.0 New U(s) = 0.5 + 0.1 [ 0 + -1 – 0.5 ] = 0.5 + 0.1 [-1.5] = 0.35TD Updates: Linear Values Assume a linear value function: Can almost do a TD update: Problem: we can’t “increment” U(s) explicitly Solution: update the weights of the features at that stateLearning Eval Parameters with TD Ideally, want eval(s) to be the utility of s Idea: use techniques from reinforcement learning Samuel’s 1959 checkers system Tesauro’s 1992 backgammon system (TD-Gammon) Basic approach: temporal difference updates Begin in state s Choose action using limited minimax search See what opponent does End up in state s’ Do a value update of U(s) using U(s’) Not guaranteed to converge against an adversary, but can work in practiceQ-Learning With TD updates on values You don’t need the model to update the utility estimates You still do need it to figure out what action to take! Q-Learning with TD updates No model needed to learn or to choose actionsTD Updates for Linear Qs Can use TD learning with linear Qs (Actually it’s just like the perceptron!) Old Q-learning update: Simply update weights of features in Qθ(a,s)Coming Up Real-world applications Large- scale machine / reinforcement learning NLP: language understanding and translation Vision: object and face
View Full Document