Berkeley COMPSCI 188 - Lecture 25: Games II - D573194

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 188> Lecture 25: Games II

DOC PREVIEW

Berkeley COMPSCI 188 - Lecture 25: Games II

School name University of California, Berkeley

Course Compsci 188- Introduction to Artificial Intelligence

Pages 3

This preview shows page 1 out of 3 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CS 188: Artificial IntelligenceSpring 2006Lecture 25: Games II4/20/2006Dan Klein – UC BerkeleyRecap: Minimax TreesMinimax Search DFS Minimaxα-β Pruning Example [Code in book]α-β Pruning General configuration α is the best value (to MAX) found so far off the current path If V is worse than α, MAX will avoid it, so prune V’s branch Define β similarly for MIN2α-β Pruning Properties Pruning has no effect on final result Good move ordering improves effectiveness of pruning With “perfect ordering”: Time complexity drops to O(bm/2) Doubles solvable depth Full search of, e.g. chess, is still hopeless! A simple example of metareasoning, here reasoning about which computations are relevantResource Limits Cannot search to leaves Limited search Instead, search a limited portion of the tree Replace terminal utilities with an evalfunction for non-terminal positions Guarantee of optimal play is gone Example: Suppose we have 100 seconds, can explore 10K nodes / sec So can check 1M nodes per move α-β reaches about depth 8 – decent chess programEvaluation Functions Function which scores non-terminals Ideal function: returns the utility of the position In practice: typically weighted linear sum of features: e.g. f1(s) = (num white queens – num black queens), etc.Function Approximation Problem: inefficient to learn each state’s utility (or evalfunction) one by one Solution: what we learn about one state (or position) should generalize to similar states Very much like supervised learning If states are treated entirely independently, we can only learn on very small state spacesLinear Value Functions Another option: values are linear functions of features of states (or action-state pairs) Good if you can describe states well using a few features (e.g. for game playing board evaluations) Now we only have to learn a few weights rather than a value for each state0.600.700.80 0.850.65 0.700.800.900.750.850.95Recap: Model-Free Learning Recall MDP value updates for a given estimate of U If you know the model T, use Bellman update Temporal difference learning (TD) Make (epsilon greedy) action choice (or follow a provided policy) Update using results of the action3Example: Tabular Value Updates Example: Blackjack +1 for win, -1 for loss or bust, 0 for tie Our hand shows 14, current policy says “hit” Current U(s) is 0.5 We hit, get an 8, bust (end up in s’ = “lose”) Update Old U(s) = 0.5 Observed R(s) = 0 Old U(s’) = -1 New U(s) = U(s) + α [ γ (R(s) + U(s’) – U(s) ] If α = 0.1, γ = 1.0 New U(s) = 0.5 + 0.1 [ 0 + -1 – 0.5 ] = 0.5 + 0.1 [-1.5] = 0.35TD Updates: Linear Values Assume a linear value function: Can almost do a TD update: Problem: we can’t “increment” U(s) explicitly Solution: update the weights of the features at that stateLearning Eval Parameters with TD Ideally, want eval(s) to be the utility of s Idea: use techniques from reinforcement learning Samuel’s 1959 checkers system Tesauro’s 1992 backgammon system (TD-Gammon) Basic approach: temporal difference updates Begin in state s Choose action using limited minimax search See what opponent does End up in state s’ Do a value update of U(s) using U(s’) Not guaranteed to converge against an adversary, but can work in practiceQ-Learning With TD updates on values You don’t need the model to update the utility estimates You still do need it to figure out what action to take! Q-Learning with TD updates No model needed to learn or to choose actionsTD Updates for Linear Qs Can use TD learning with linear Qs (Actually it’s just like the perceptron!) Old Q-learning update: Simply update weights of features in Qθ(a,s)Coming Up Real-world applications Large- scale machine / reinforcement learning NLP: language understanding and translation Vision: object and face

View Full Document

Berkeley COMPSCI 188 - Lecture 25: Games II

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 3 pages.

Berkeley COMPSCI 188 - Lecture 25: Games II

Sign up for free to view:

Please select your school