CS 188: Artificial Intelligence Spring 2006Recap: Minimax TreesMinimax SearchDFS Minimax- Pruning Example- Pruning- Pruning PropertiesResource LimitsEvaluation FunctionsFunction ApproximationLinear Value FunctionsRecap: Model-Free LearningExample: Tabular Value UpdatesTD Updates: Linear ValuesLearning Eval Parameters with TDQ-LearningTD Updates for Linear QsComing UpCS 188: Artificial IntelligenceSpring 2006Lecture 25: Games II4/20/2006Dan Klein – UC BerkeleyRecap: Minimax TreesMinimax SearchDFS Minimax- Pruning Example[Code in book]- PruningGeneral configuration is the best value (to MAX) found so far off the current pathIf V is worse than , MAX will avoid it, so prune V’s branchDefine similarly for MIN- Pruning PropertiesPruning has no effect on final resultGood move ordering improves effectiveness of pruningWith “perfect ordering”:Time complexity drops to O(bm/2)Doubles solvable depthFull search of, e.g. chess, is still hopeless!A simple example of metareasoning, here reasoning about which computations are relevantResource LimitsCannot search to leavesLimited searchInstead, search a limited portion of the treeReplace terminal utilities with an eval function for non-terminal positionsGuarantee of optimal play is goneExample:Suppose we have 100 seconds, can explore 10K nodes / secSo can check 1M nodes per move- reaches about depth 8 – decent chess programEvaluation FunctionsFunction which scores non-terminalsIdeal function: returns the utility of the positionIn practice: typically weighted linear sum of features:e.g. f1(s) = (num white queens – num black queens), etc.Function ApproximationProblem: inefficient to learn each state’s utility (or eval function) one by oneSolution: what we learn about one state (or position) should generalize to similar statesVery much like supervised learningIf states are treated entirely independently, we can only learn on very small state spacesLinear Value FunctionsAnother option: values are linear functions of features of states (or action-state pairs)Good if you can describe states well using a few features (e.g. for game playing board evaluations)Now we only have to learn a few weights rather than a value for each state0.600.700.80 0.850.65 0.700.800.900.750.850.95Recap: Model-Free LearningRecall MDP value updates for a given estimate of UIf you know the model T, use Bellman updateTemporal difference learning (TD)Make (epsilon greedy) action choice (or follow a provided policy)Update using results of the actionExample: Tabular Value UpdatesExample: Blackjack+1 for win, -1 for loss or bust, 0 for tieOur hand shows 14, current policy says “hit”Current U(s) is 0.5We hit, get an 8, bust (end up in s’ = “lose”)UpdateOld U(s) = 0.5Observed R(s) = 0Old U(s’) = -1New U(s) = U(s) + [ (R(s) + U(s’) – U(s) ]If = 0.1, = 1.0New U(s) = 0.5 + 0.1 [ 0 + -1 – 0.5 ] = 0.5 + 0.1 [-1.5] = 0.35TD Updates: Linear ValuesAssume a linear value function:Can almost do a TD update:Problem: we can’t “increment” U(s) explicitlySolution: update the weights of the features at that stateLearning Eval Parameters with TDIdeally, want eval(s) to be the utility of sIdea: use techniques from reinforcement learningSamuel’s 1959 checkers systemTesauro’s 1992 backgammon system (TD-Gammon)Basic approach: temporal difference updatesBegin in state sChoose action using limited minimax searchSee what opponent doesEnd up in state s’Do a value update of U(s) using U(s’)Not guaranteed to converge against an adversary, but can work in practiceQ-LearningWith TD updates on valuesYou don’t need the model to update the utility estimatesYou still do need it to figure out what action to take!Q-Learning with TD updatesNo model needed to learn or to choose actionsTD Updates for Linear QsCan use TD learning with linear Qs(Actually it’s just like the perceptron!)Old Q-learning update:Simply update weights of features in Q(a,s)Coming UpReal-world applicationsLarge-scale machine / reinforcement learningNLP: language understanding and translationVision: object and face
View Full Document