Unformatted text preview:

1CMSC828G Principles of Data Mining Lecture #16•Readings:– handouts•Today’s Lecture:–Learning BNs– slides courtesy of Nir Friedman• Upcoming Dates:– Project Progress Report due Tuesday 4/9Learning Bayesian NetworksLearning Bayesian NetworksLearning Bayesian networksInducerInducerData + Prior informationERBAC.9 .1ebe.7 .3.99 .01.8 .2bebbeBE P(A | E,B)Known Structure -- Complete DataE, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y>..<N,Y,Y>InducerInducerE BA.9 .1ebe.7 .3.99 .01.8 .2bebbeBE P(A | E,B)??ebe??????bebbeBE P(A | E,B)E BA• Network structure is specified– Inducer needs to estimate parameters• Data does not contain missing valuesUnknown Structure -- Complete DataE, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y>..<N,Y,Y>InducerInducerE BA.9 .1ebe.7 .3.99 .01.8 .2bebbeBE P(A | E,B)??ebe??????bebbeBE P(A | E,B)E BA• Network structure is not specified– Inducer needs to select arcs & estimate parameters• Data does not contain missing valuesKnown Structure -- Incomplete DataInducerInducerE BA.9 .1ebe.7 .3.99 .01.8 .2bebbeBE P(A | E,B)??ebe??????bebbeBE P(A | E,B)E BA• Network structure is specified• Data contains missing values– We consider assignments to missing valuesE, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?>..<?,Y,Y>2Known Structure / Complete Data• Given a network structure G– And choice of parametric family for P(Xi|Pai)• Learn parameters for networkGoal• Construct a network that is “closest” to probability that generated the dataLearning Parameters for a Bayesian NetworkE BAC⋅⋅⋅⋅⋅⋅⋅⋅=][][][][]1[]1[]1[]1[MCMAMBMECABED• Training data has the form:Learning Parameters for a Bayesian NetworkE BAC• Since we assume i.i.d. samples,likelihood function is∏Θ=ΘmmCmAmBmEPDL):][],[],[],[():(Learning Parameters for a Bayesian NetworkE BAC• By definition of network, we get∏∏ΘΘΘΘ=Θ=ΘmmmAmCPmEmBmAPmBPmEPmCmAmBmEPDL):][|][():][],[|][():][():][():][],[],[],[():(⋅⋅⋅⋅⋅⋅⋅⋅][][][][]1[]1[]1[]1[MCMAMBMECABEGeneral Bayesian NetworksGeneralizing for any Bayesian network:• The likelihood decomposes according to the structure of the network.∏∏∏∏∏∏Θ=Θ=Θ=Θ=ΘiiiimiiimiiiimnDLmPamxPmPamxPmxmxPDL):():][|][():][|][():][,],[():(1Ki.i.d. samplesNetwork factorizationGeneral Bayesian Networks (Cont.)Decomposition ⇒ Independent Estimation ProblemsIf the parameters for each family are not related, then they can be estimated independently of each other.3Likelihood for Multinomial Networks• When we assume that P(Xi| Pai) is multinomial, we get further decomposition:• For each value paiof the parents of Xiwe get an independent multinomial problem•The MLE is∏∏=Θiiiiiipa xpaxNpaxiiDL),(|):(θ)(),(ˆ|iiipaxpaNpaxNii=θBayesian Inference (cont.)Prediction as inference in this networkwhereθθθ+=θθθ+=+∫∫dMxxPMxPdMxxPMxxMxPMxxMxP])[,],1[|()|]1[(])[,],1[|(])[,],1[,|]1[(])[,],1[|]1[(KKKK])[],1[()()|][],1[(])[],1[|(MxxPPMxxPMxxPKKKθθ=θPosteriorLikelihoodPriorProbability of dataθX[1] X[2] X[m]X[m+1]Dirichlet Priors• Recall that the likelihood function is •A Dirichlet prior with hyperparameters α1,…,αK is defined asfor legal θ1,…, θK Then the posterior has the same form, with hyperparameters α1+N 1,…,αK +N K ∏==ΘK1kNkkDL θ):(∏=−αθ∝ΘKkkkP11)(∏∏∏=−+==−=∝ΘΘ∝ΘKkNkKkNkKkkkkkkDPPDP11111)|()()|(ααθθθDirichlet Priors (cont.)• We can compute the prediction on a new event in closed form: •If P(Θ) is Dirichlet with hyperparameters α1,…,αK then• Since the posterior is also Dirichlet, we get∑∫αα=ΘΘ⋅θ==llkkd)(P)k]1[X(P∑∫+α+α=ΘΘ⋅θ==+lll)N(Nd)D|(P)D|k]1M[X(PkkkDirichlet Priors -- Example00.511.522.533.544.550 0.2 0.4 0.6 0.8 1Dirichlet(1,1)Dirichlet(2,2)Dirichlet(0.5,0.5)Dirichlet(5,5)Effect of Priors (cont.)• In real data, Bayesian estimates are less sensitive to noise in the data0.10.20.30.40.50.60.75 10 15 20 25 30 35 40 45 50P(X = 1|D)N MLEDirichlet(.5,.5)Dirichlet(1,1)Dirichlet(5,5)Dirichlet(10,10)N01Toss Result4Bayesian Prediction(cont.)• We can compute the posterior for each multinomial θXi| paiindependently– The posterior is Dirichlet with parametersα(Xi=1|pai)+N (Xi=1|pai),…, α(Xi=k|pai)+N (Xi=k|pai)• The predictive distribution is then represented by the parameters)pa(N)pa()pa,x(N)pa,x(~iiiiiipa|xii+α+α=θLearning Parameters: Case Study (cont.)Experiment:• Sample a stream of instances from the alarm network• Learn parameters using – MLE estimator– Bayesian estimator with uniform prior with different strengthsLearning Parameters: Case Study (cont.)00.20.40.60.811.21.40 500 1000 1500 2000 2500 3000 3500 4000 4500 5000KL DivergenceMMLEBayes w/ Uniform Prior, M'=5Bayes w/ Uniform Prior, M'=10Bayes w/ Uniform Prior, M'=20Bayes w/ Uniform Prior, M'=50Learning Parameters: Summary• Estimation relies on sufficient statistics– For multinomial these are of the form N (xi,pai)– Parameter estimation• Bayesian methods also require choice of priors• Both MLE and Bayesian are asymptotically equivalent and consistent• Both can be implemented in an on-line manner by accumulating sufficient statistics)()(),(),(~|iiiiiipaxpaNpapaxNpaxii+α+α=θ)(),(ˆ|iiipaxpaNpaxNii=θMLEBayesian (Dirichlet)Learning Structure from Complete DataWhy Struggle for Accurate Structure?• Increases the number of parameters to be fitted• Wrong assumptions about causality and domain structure• Cannot be compensated by accurate fitting of parameters• Also misses causality and domain structureEarthquake Alarm SetSoundBurglaryEarthquake Alarm SetSoundBurglaryEarthquakeAlarm SetSoundBurglaryAdding an arc Missing an arc5Approaches to Learning Structure• Constraint based– Perform tests of conditional independence– Search for a network that is consistent with the observed dependencies and independencies• Pros & Cons+ Intuitive, follows closely the construction of BNs + Separates structure learning from the form of the independence tests− Sensitive to errors in individual tests− Computationally hardApproaches to Learning Structure• Score based– Define a score that evaluates how well the (in)dependencies in astructure match the observations– Search for a structure that maximizes the score• Pros & Cons+ Statistically motivated+ Can make compromises+ Takes the structure of conditional probabilities into account− Computationally hardLikelihood Score for


View Full Document

UMD CMSC 828G - Principles of Data Mining

Documents in this Course
Lecture 2

Lecture 2

35 pages

Load more
Download Principles of Data Mining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Principles of Data Mining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Principles of Data Mining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?