Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49Slide 50Slide 51Slide 52David V. PynadathAffective Behavior with Theory of Mind2Reason vs. EmotionDistinction in philosophy–Reason should be slave of emotion (Hume)–Our emotions impede our reason (Stoics)Distinction in computational modeling–Emotions are irrational–Emotion is needed for more accurate modelsBut what is reason?3Rational Decision MakingWhat is “rational”–e.g., for the row player in IPD?Cooperate DefectCooperate 3,3 0,5Defect 5,0 1,14Is defecting rational?It is a Nash equilibrium strategy–It is a best response to either action–But what if opponent is playing tit-for-tat?Cooperate DefectCooperate 3,3 0,5Defect 5,0 1,15Is cooperating rational?It produces a social optimum–Best response to tit-for-tat–But what if other player always defects?Cooperate DefectCooperate 3,3 0,5Defect 5,0 1,16Ideal Rational Agent (Russell & Norvig)“For each possible percept sequence, an ideal rational agent should do whatever action is expected to maximize its performance measure, on the basis of the evidence provided by the percept sequence and whatever built-in knowledge the agent has.”7Performance MeasureMy total score–Plus % of my partner's?–Sooner better than later?Cooperate DefectCooperate 3,3 0,5Defect 5,0 1,18KnowledgeI believe:–People are cooperative–People are selfishCooperate DefectCooperate 3,3 0,5Defect 5,0 1,19EvidenceIf I defect, then other defects:–Tit-for-tat?–Some other strategy?Cooperate DefectCooperate 3,3 0,5Defect 5,0 1,110Rational if maximizing performance“...whatever action is expected to maximize its performance measure, on the basis of the evidence...”Decision Theory–Utility represents performance measure–Probability distribution captures evidence–Choose action to maximize expected utility11Rational if using knowledge of other“...the evidence provided by the percept sequence and whatever built-in knowledge the agent has.”Theory of Mind–Other people are also reasoning–Maximizing their performance measure–...and also using Theory of Mind about me12PsychSimDecision Theory + Theory of Mind–Maximize expected utility–Uncertainty about model of the otherOpen Question:–Enough to express social phenomena?13Markov Decision ProblemsActionStateRewardState14Markov Decision ProblemsState, s–Money won by me–Money won by otherAction, a–Cooperate–DefectReward, R(s,a)–My money + α·other's money15Markov Decision ProblemsTransition Probability, P(s0,a,s1)–For every possible initial state and action➔Probability over resulting stateCompute Expected Reward–Vt(s0,a) = R(s0,a)+Σs1P(s0,a,s1)Vt-1(s1)where V(s) = maxa V(s,a)Many off-the-shelf algorithms16What's missing?R(s,a) and P(s0,a,s1)–Both depend on action of other playerCooperate DefectCooperate 3,3 0,5Defect 5,0 1,117Mental ModelsWhat is in the other players' head?–I can't read minds–But I have prior knowledge about people–And as we iterate the game, I get evidenceState has possible models of others–e.g., beliefs, reward, strategy, etc.Including mental models of me–e.g., my beliefs, my reward, my strategy, etc.»Including my mental models of other»...18Mental Models in IPDMental models are hidden state–Reward: My money + α·other's money–α =1 (altruistic), 0 (selfish), 0.5 (mixed)–Models of me: α =1 (altruistic), 0 (selfish)Other's action driven by model–Model affects transition of my action–Model affects reward of my action19Markov Decision Problems (MDPs)ActionState+Mode lRewardState+Mode l20Partially Observable MDPsActionState+Mode lRewardState+Mode lBeliefs21Partially Observable MDPsActionState+Mode lRewardState+Mode lBeliefsObservation22Partially Observable MDPsActionState+Mode lRewardState+Mode lBeliefsObservationBeliefs23Hypothetical Reasoning (mixed case)If other is altruistic:–Regardless of what other thinks of meI expect other to cooperate (6 > 5 and 5 > 1)So I will defect (5 > 4.5)If other is selfish:–Regardless of what other thinks of meI expect other to defect (5 >3 and 1 > 0)So I will cooperate (2.5 > 1.5)24Hypothetical Reasoning (mixed case)If other is mixed:–And thinks I am altruisticI expect other will defect (5 > 4.5)So I will cooperate (2.5 > 1.5)–And thinks I am selfishI expect other will cooperate (2.5 > 1.5)So I will defect (5 > 4.5)25Hypothetical Reasoning (mixed case)Unfortunately, uncertainty about other–ER(Cooperate)P(other is altruistic) * 4.5 ++P(other is selfish) * 2.5+P(other is mixed, thinks I am altruistic) * 2.5+P(other is mixed, thinks I am selfish) * 4.5–ER(Defect)P(other is altruistic) * 5.0 +P(other is selfish) * 1.5P(other is mixed, thinks I am altruistic) * 1.5P(other is mixed, thinks I am selfish) * 5.026Hypothetical Reasoning (mixed case)But this is Iterated Prisoner's Dilemma–Immediate reward is only one part–My action affects other's beliefs about me–Which in turn affects other's future behavior–Which in turn affects my future rewardsI update my beliefs as well–Observe other's action–Modify my distribution over α =0, 0.5, 127ConsistencyIf I observe the other cooperate:–I should increase belief that other is altruistic–I should decrease belief that other is selfishComputationally:–Agents more likely to pursue higher reward–Models that give observed action higher reward are more likely28Hypothetical Reasoning RevisitedIf other is altruistic:–Regardless of what other thinks of meI expect other to cooperate (6 > 5 and 5 > 1)So I will defect (5 > 4.5)If other is selfish:–Regardless of what other thinks of meI expect other to defect (5 >3 and 1 > 0)So I will cooperate (2.5 > 1.5)29Belief UpdateModel that better explains behavior–Reuse expectations already generated–Favor models where behavior has higher rewardUse same mechanism to model other–if I cooperate, other thinks I'm more altruistic–If I defect, other thinks I'm more selfish30Decision Cycle and Belief UpdateActionState+Mode lRewardState+Mode
View Full Document