Chapter 6 Adaptation Adaptation Assume static world before Learning Previous chapters about calculated rationality some sort of choice in a static system Here we assume that person acts the world reacts and then the person needs to choose to adapt to world s response Thus adaptively rational Adaptation and Communication Rats in a maze the cheese is a message that means turn towards A kid gets in trouble to get attention the trouble is a message that Someone or the world communicates to a person who then learns here says attend to me and adapts Teaching persuading influencing attracting does all communication ask for adaptation by the receiver Basic Model Reinforcement learning is main topic of the chapter o Alternative learning models S R learning modeling behavior etc Ex of rat Alfred in maze o Initially random beh o Finds some reinforcement o Adapts beh to the prospect of reward Alfred Trial and error learning Behavior becomes less and less random At first o Pr 0 Pl 0 50 Pr is the probability of a right turn Alfred 0 refers to time zero before the first trial Since the food is always on the right over time the probability of turning right increases o Pr t 1 Pr t some increment So we can understand our problem as needing to model the increment Modeling the Increment Model 1 a constant increment model o Suppose the increment is 2 Pr 0 50 Pr 1 70 Pr 2 90 Pr 3 1 10 o Well it looked reasonable for a while Model 2 a constant proportion model o If the cheese is always on the right then 1 Pr t represents how much Alfred has yet to learn o Suppose we model him this way at each trial Alfred learns a constant proportion of what he has left to learn Model 2 requires some constant a that represents the proportion Alfred learns on each trial Pr 1 Pr 0 a 1 Pr 0 This means the probability of going right on some trial that probability from previous trial plus the rate of learning a what had yet to be learned on the prior trial o If you have cheese you re probability of going to the right on the next trial is equal to your probability of going to right before an increment that is how fast you learned from finding cheese X how much you have left to learn For Pr 0 in the rat example 0 refers to point of time It is the rat s initial probability of turning right before receiving any training Yes Model 2 never gives a result where the probability is 1 Since a is a fraction at each point you just add a fraction of what was left to learn Produces an asymptotic curve where Alfred s behavior approaches Pr 1 P 225 Model 2 fits observed results reasonably well Predicts o Learning occurs o Probabilities rise faster at the start of learning o Stupid choices remain possible though decreasingly likely o We can measure learning rate a 11 14 12 Assumptions Alternative behaviors available left right State of individual described by probabilities that add up to 1 World responds differently to various behaviors Cheese no cheese State of world described by probabilities of various responses to each behavior right 70 cheese Set of possible events is combination of possible behaviors crossed by possible world responses left no reward left reward Specify adaptation equations for each possible event event 1 left no reward Adaptation Equations learning adaptation So the model generates sets of equations and these describe the o Accomplished by explaining how the probabilities change Text model will be restricted to o Two choices left right o Two outcomes cheese no cheese o One person or party doing the learning Alfred Two initial choice probabilities o Pr t probability of doing behavior R at time t o Pl t 1 Pr t probability of doing behavior L at t 4 events o E1 L and reward o E2 L and no reward o E3 R and reward o E4 R and no reward Alfred s Equations E1 L and reward increase PL decrease PR o Pl t 1 Pl t a 1 Pl t o We ve seen this before a is the rate of learning what s yet to be learned E2 L and no reward decrease Pl increase PR o PL t 1 Pl t b Pl t o The b is new It is the learning rate associated with non reward just as the a is the rate for reward It also takes values between o and 1 o NB the negative sign prior to the b o Minus sign because it s a decrement no cheese o A and B both fractions between 0 and 1 The b Why is it multiplied against Pl t instead of 1 Pl t like a is Because the non reward reduces the likelihood of doing Pl again So we multiply Pl by some fraction like 2 ie b or something and subtract that fraction from the orignal Pl After non reward Clicker Questions The higher a is the faster the party learns by being rewarded The higher b is the faster the party learns by not being rewarded Alfred s Equations E3 R and reward decrease PL increase PR o Pr t 1 Pr t a 1 Pr t E4 R and no reward increase Pl decrease Pr o Pr t 1 Pr t b Pr t Alfred s Equations Notice that a and b have no particular relationship to another summing to 1 or being equal Preliminary Conclusions Rewarded behavior becomes more probable and non rewarded Rewards cause behavior to change at rate a and non rewards cause behavior less probable behavior to change at rate b The amount of adaptation of any trial is always a constant fraction a or b of the amount left to learned A worked Alfred Problem P 262 f Assume Pl 0 50 Assume a 3 Assume b 2 Make sure you can do the math NB Pl 1 Pr or Pl Pr 1 Worked problem o A 5 o B 1 o Pl 0 3 o Reward on left o Go left go right o Pl 1 Pl 0 a 1 Pl 0 3 5 1 3 65 65 go left 35 go right 37 37 go left 63 go right o 65 37 does not equal 1 o PL 1 1 Pr 1 1 Pr 0 b Pr 0 1 7 1x 7 1 63 Do not need to add up to 1 The only probabilities that need to add up to 1 are ones that come out of same node 3 7 but Pl 1 65 and Pl 1 37 do not come out of same node o 35 go right on trial 2 Pl 2 685 o 65 go left on trial 2 Pl 2 825 Went left twice got rewarded twice Need increment equation because rewarded twice probability should be higher Pl 2 Pl 1 a 1 Pl 1 65 5 35 825 Pl 2 1 Pr 2 1 Pr …
View Full Document