1CS 2710 Foundations of AICS 2710 Foundations of AILecture 24Milos [email protected] Sennott SquareLearning probability distributionsCS 2710 Foundations of AIUnsupervised learning• Data:a vector of attribute values– e.g. the description of a patient– no specific target attribute we want to predict (no output y)• Objective:– learn (describe) relations between attributes, examplesTypes of problems:• ClusteringGroup together “similar” examples• Density estimation– Model probabilistically the population of examples},..,,{21 nDDDD =iiD x=2CS 2710 Foundations of AIDensity estimationData: Attributes:• modeled by random variables with:– Continuous values– Discrete valuesE.g. blood pressure with numerical values or chest pain with discrete values [no-pain, mild, moderate, strong]Underlying true probability distribution:},..,,{21 nDDDD =iiD x=a vector of attribute values},,,{21 dXXX K=X)(XpCS 2710 Foundations of AIDensity estimationData: Objective: try to estimate the underlying true probability distribution over variables , , using examples in DStandard (iid) assumptions: Samples• are independent of each other• come from the same (identical) distribution (fixed )},..,,{21 nDDDD =iiD x=a vector of attribute valuesX)(Xp},..,,{21 nDDDD=n samplestrue distributionestimate)(ˆXp)(Xp)(Xp3CS 2710 Foundations of AILearning via parameter estimationIn this lecture we consider parametric density estimationBasic settings:• A set of random variables • A model of the distribution over variables in Xwith parameters • DataObjective: find parameters that fit the data the best, or in other words reduce the misfit between the data and the model • What is the best set of parameters? – There are various criteria one can apply here.},,,{21 dXXX K=XΘ},..,,{21 nDDDD =ΘˆCS 2710 Foundations of AIParameter estimation. Basic criteria.• Maximum likelihood (ML) criterion• Maximum a posteriori probability (MAP) criterion),|(maxargξΘΘDpξ- represents prior (background) knowledge),|(maxargξDp ΘΘ)|()|(),|(),|(ξξξξDppDpDpΘΘ=ΘMAP selects the mode of the posteriorLikelihood of dataPosterior probability4CS 2710 Foundations of AIParameter estimation. Coin example.Coin example: we have a coin that can be biasedOutcomes: two possible values -- head or tailData: D a sequence of outcomes such that • head• tailModel: probability of a headprobability of a tailObjective:We would like to estimate the probability of a headfrom dataθ)1(θ−0=ix1=ixixθˆCS 2710 Foundations of AIParameter estimation. Example.• Assume the unknown and possibly biased coin• Probability of the head is• Data:H H T T H H T H T H T T T H T H H H H T H H H H T– Heads: 15– Tails: 10What would be your estimate of the probability of a head ?θ?~=θ5CS 2710 Foundations of AIParameter estimation. Example• Assume the unknown and possibly biased coin• Probability of the head is• Data:H H T T H H T H T H T T T H T H H H H T H H H H T– Heads: 15– Tails: 10What would be your choice of the probability of a head ?Solution: use frequencies of occurrences to do the estimateThis is the maximum likelihood estimate of the parameterθ6.02515~==θθCS 2710 Foundations of AIProbability of an outcomeData: D a sequence of outcomes such that •head• tailModel: probability of a headprobability of a tailAssume: we know the probabilityProbability of an outcome of a coin flip– Combines the probability of a head and a tail– So that is going to pick its correct probability – Gives for– Gives for)1()1()|(iixxixP−−=θθθθ)1(θ−0=ix1=ixixBernoulli distributionixixθ)1(θ−0=ix1=ixθ6CS 2710 Foundations of AIProbability of a sequence of outcomes.Data: D a sequence of outcomes such that •head• tailModel: probability of a headprobability of a tailAssume: a sequence of independent coin flips D = H H T H T H (encoded as D= 110101)What is the probability of observing the data sequence D:?)|( =θDPθ)1(θ−0=ix1=ixixCS 2710 Foundations of AIProbability of a sequence of outcomes.Data: D a sequence of outcomes such that •head• tailModel: probability of a headprobability of a tailAssume: a sequence of coin flips D = H H T H T Hencoded as D= 110101What is the probability of observing a data sequence D:θθθθθθθ)1()1()|(−−=DPθ)1(θ−0=ix1=ixix7CS 2710 Foundations of AIProbability of a sequence of outcomes.Data: D a sequence of outcomes such that •head• tailModel: probability of a headprobability of a tailAssume: a sequence of coin flips D = H H T H T Hencoded as D= 110101What is the probability of observing a data sequence D:θθθθθθθ)1()1()|(−−=DPθ)1(θ−0=ix1=ixixlikelihood of the dataCS 2710 Foundations of AIProbability of a sequence of outcomes.Data: D a sequence of outcomes such that •head• tailModel: probability of a headprobability of a tailAssume: a sequence of coin flips D = H H T H T Hencoded as D= 110101What is the probability of observing a data sequence D:Can be rewritten using the Bernoulli distribution:θθθθθθθ)1()1()|(−−=DPθ)1(θ−0=ix1=ixix)1(61)1()|(iixixDP−=−=∏θθθ8CS 2710 Foundations of AIThe goodness of fit to the data.Learning: we do not know the value of the parameterOur learning goal: • Find the parameter that fits the data D the best? One solution to the “best”: Maximize the likelihoodIntuition:• more likely are the data given the model, the better is the fitNote: Instead of an error function that measures how bad the data fit the model we have a measure that tells us how well the data fit :θ)1(1)1()|(iixnixDP−=−=∏θθθθ)|(),(θθDPDError−=CS 2710 Foundations of AIMaximum likelihood (ML) estimate.Maximum likelihood estimate1N- number of heads seen2N- number of tails seen),|(maxargξθθθDPML=Likelihood of data:)1(1)1(),|(iixnixDP−=−=∏θθξθOptimize log-likelihood (the same as maximizing likelihood)=−==−=∏)1(1)1(log),|(log),(iixnixDPDlθθξθθ)1()1log(log)1log()1(log111∑∑∑===−−+=−−+niiniiniiixxxxθθθθ9CS 2710 Foundations of AIMaximum likelihood (ML) estimate.2111NNNNNML+==θML Solution:Optimize log-likelihood)1log(log),(21θθθ−+=NNDlSet derivative to zero0)1(),(21=−−=∂∂θθθθNNDl211NNN+=θSolvingCS 2710 Foundations of AIMaximum likelihood estimate. Example• Assume the
View Full Document