Published: From Animals to Animats, Fourth International Conferenceon Simulation of Adaptive Behavior, (SAB'96)Cape Cod, Massachusetts. September, 1996Written By:Andrew Kachites McCallumPresented By:James M. VaccaroCSE 254 June 4, 2002Learning to Use Selective Attention andShort-Term Memory in Sequential TasksScenarioApplication is a Sequential Task for Passing trucks in heavy trafficUse Selective Attention & Action to Observe & React to Current Statecartrucksfieldof viewfour laneroadwaytrucksRewards: -10 for scrapping a truck, -1 for causing a truck to beep, and 0.1 otherwisegaze-forward-rightgaze-backwardgaze-backward gaze-forward-leftshift-to-gaze-laneshift-to-gaze-lanegaze-forward-centergaze-backward carSensors:Hear hornGaze objectGaze sideGaze directionGaze speedGaze distanceGaze refined distanceGaze colorActions:Environment:Outline• Issue: Dealing with Too Much and Too Little Information• New Algorithm: U-Tree Algorithm• U-Tree Algorithm Components1. Utile Distinction Memory2. Instance-Based Learning with Nearest Sequence Memory3. Utile Suffix Memory and Learning Tree• U-Tree Algorithm Revisited1. Example of Combining Approaches2. Transition Instance Chain3. Learning U-Tree Algorithm• Application & Demo• Results, Inconsistencies & SummaryDealing with Too MuchToo many possible sensory states, actions or observationsNot all pertinent states can be observed at any one timeand Too Little Information|A | = 5|D1| = 2|D2| = 3|D3| = 3|D4| = 2|D5| = 2|D6| = 3|D7| = 2|D8| = 6|R| = 3t = 0t = -1t = -2t = -3t = 1t = 2t = 3AD1D2D3D4D5D6D7D8R8 dimensional feature spacetimeRewardsActionsHear hornGaze objectGaze sideGaze directionGaze speedGaze distanceGaze refined distanceGaze colorCompact, contextdependent setU-Tree AlgorithmKey feature is it simultaneously handles:- “too much sensory data” - “too little sensory data”Two primary capabilities:- Selective attention, to prune tree of instances- Short-term memory, to augment perceptual state spaceCombines advantages of several previous algorithms:- Parti-Game, Nearest Sequence Memory- Utile Distinction Memory- G-Algorithm- Utile Suffix MemoryTwo key structures:- Time-ordered chain of instances that represent raw-experience- A tree where these instances are organized(1) Utile Distinction Memory [McCallum 1993]Uses a statistical technique, a utile distinction test to separatenoise from task structure and keep only short-term memoriesthat help predict reward(2) Parti-Game [Moore, 1993] &Nearest Sequence Memory [McCallum 1995]instance-based finding conjunctions among features(3) G-Algorithm [Chapman 1989]Agent can select which individual features to attend to New Feature: Ability to divide percepts intocomponents to perform selective attention(1) & (2) Utile Suffix Memory (USM) [McCallum 1995]What Makes Up the U-Tree Algorithm?Task: Three alternative methods make up the U-Tree algorithm1stTechnique: Utile Distinction Memory (UDM)The utile distinction test distinguishes states that have different policy actions ordifferent utilities, and merges states that have the same policyaction and same utility.Notation:? ?m,...,o,oo21?OObservations:where m is the number of observed features? ?n,...,s,ssS21?States:where n is the number of states? ?k,...,a,aaA21?Actions:where k is the number of possible actions)(ji|soOObservation probability: Reward:r)r(,jika|ssPTransition probability: State-action value:q(si,aj)? ?t??State probability: Utility or use of state:? ?? ?tU ??Theorem: The state distinctions necessary for representing the optimal policy are not necessarily sufficient for learning the optimal policy (proof by counter example)s(t) = <a(t-1),o(t),r(t)>State make-up:? ?? ?,atpaat?Qmaxarg??1)),(())((attaU 11 ?? ? ????Qmax? ?? ?? ? ? ?? ?? ?? ?? ?11?????tp?Urtßp,asqtp-,as,qsiitiitii??5. Choose best action:3. Update Q-values:? – learning rate? – temporal discount factor4. Update expected utility:Utile Distinction Memory Algorithm??ijiijasqta,t),()())((???Q)(),r()(1)(ta|ssP|soktitijjtji???????1O2. Current policy:1. Agent’s belief in sj:State in considerationfor splittingConfidence intervals of futurediscounted reward for eachof the four possible actions.Four outgoing actionsEvaluating Potential Node Splitting Possibilitiesreturn[m] = r[m]for t = m – 1 to 0return[t] = r[t] + ?*return[t + 1]Calculate return values over time:for t = 1 to m – 1for all transitions, trans,that use action A[t – 1]? = ?*?trans[t – 1]trans.countA[t]+= ?trans.sumA[t]+= ?*return[t]trans.sumsquaresA[t]+= ?*(return[t])2Calculate upper/lower bound statistics:? is learning ratePossible previous states(part of a state space model)r1r2r3r4r5Confidence intervals onfuture discounted reward:1 2 3 4 51 2 3 5 466a 6bBefore split:After split:Splitting Nodes Based on Confidence IntervalsNode Splitting Test: Kolmogorov-Smirnov (KS)Statistical tests (in general, for KS and with details):1. null hypothesisKS: Calculate or use Relative frequency distribution F0(X)Assume F0(X) is a student t function with n-1 degrees of freedom.Calculate a sample mean and standard deviation fromobserved future discounted rewards for each action.2. an alternative hypothesisKS: Observed cumulative frequency distribution Sn(X) Store future discounted reward distributions for each actionand set it equal to Sn(X) .3. decision makerKS: D = max(|F0(X) - Sn(X)|)Calculate the largest absolute difference between F0(X) and Sn(X).4. rejection regionKS: For specified ? and sample size n reject if D exceeds a = ?Promote split nodes from the previous state if rejected.95a7a10a 85b7b10b 125a7a109 12578109 125a78109 1278 9 127a85b 5a 5b 5a 5b7b10a 10bTrial 1 Trial 2 Trial 3 Trial 4Utile Distinction Memory (UDM) ExampleGrid-world:Theorem: The state distinctions necessary for representing the optimal policy are not necessarily sufficient for learning the optimal policyNote: Proof by counter example.Learning in a Geometric Spacek - nearest neighbor, k = 3Learning in a Sequence Spacek - nearest neighbor, k = 31 8 5 5 2 8 5 5 2 5 5 8 5 5 9 1 8 5 5x100031021031004100action, percept, rewardmatch lengthx2ndTechnique: Nearest Sequence Memory (NSM)2 dimensions? ?? ?? ? ? ?????ijja|asjjitsqksvaQn(si, sj) = 1 + n(si-1, sj-1), if (ai-1= aj-1) ^ (oi-1= oj-1) ^ (ri-1= rj-1)0, otherwise? ?jia|as,ssnijj??max1, if n(st, si) is among the k0, otherwisev(si) = Nearest Sequence Memory Algorithm (Step 1 & 2)Step 1:Step 2:Find k-nearest neighbors of chained events:Vote for the best neighbors and resolve ties with
View Full Document