CMU CS 10601 - lecture - D1963963

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10601> lecture

DOC PREVIEW

CMU CS 10601 - lecture

School name Carnegie Mellon University

Course Cs 10601- Introduction to Machine Learning

Pages 52

This preview shows page 1-2-3-25-26-27-28-50-51-52 out of 52 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 52 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Graphical models and Bayesian networks10-601 Machine LearningIndependence• In our density estimation class (and in the Naïve Bayes classifier class) we discussed at length the usefulness of the independence assumption• However, we also mentioned its drawbacksIndependence• Independence allows for easier models, learning and inference• For example, with 3 binary variables we only need 3 parameters rather than 7. • The saving is even greater if we have many more variables … • In many cases it would be useful to assume independence, even if its not the case• Is there any middle ground?Bayesian networks• Bayesian networks are directed graphs with nodes representingrandom variables and edges representing dependency assumptions• Lets use our movie example: We would like to determine the joint probability for length, liked and slept in a movieLoLi SLong?Slept?Liked?Bayesian networks: NotationsLeLi SP(Lo) = 0.5P(Li | Lo) = 0.4P(Li | Lo) = 0.7P(S | Lo) = 0.6P(S | Lo) = 0.2Conditional probability tables (CPTs)Conditional dependencyRandom variablesBayesian networks are directed acyclic graphs.Bayesian networks: NotationsLeLi SP(Lo) = 0.5P(Li | Lo) = 0.4P(Li | Lo) = 0.7P(S | Lo) = 0.6P(S | Lo) = 0.2The Bayesian network below represents the following joint probability distribution:p(Le,Li,S)  P(Le)P(Li | Le)P (S | Le)More generally Bayesian network represent the following joint probability distribution:p(x1xn)  p(xi| Pa(xi))iThe set of parents of xiin the graphConstructing a Bayesian network• How do we go about constructing a network for a specific problem?• Step 1: Identify the random variables• Step 2: Determine the conditional dependencies• Step 3: Populate the CPTsCan be learned from observation data!A example problem• An alarm systemB – Did a burglary occur?E – Did an earthquake occur?A – Did the alarm sound off?M – Mary callsJ – John calls• How do we reconstruct the network for this problem?Factoring joint distributions• Using the chain rule we can always factor a joint distribution as follows:P(A,B,E,J,M) = P(A | B,E,J,M) P(B,E,J,M) =P(A | B,E,J,M) P(B | E,J,M) P(E,J,M) = P(A | B,E,J,M) P(B | E, J,M) P(E | J,M) P(J,M)P(A | B,E,J,M) P(B | E, J,M) P(E | J,M)P(J | M)P(M)• This type of conditional dependencies can also be represented graphically.A Bayesian networkEJ MA BNumber of parameters:A: 2^4B: 2^3E: 4 J: 2M: 1A total of 31 parametersP(A | B,E,J,M) P(B | E, J,M) P(E | J,M)P(J | M)P(M)A better approach• An alarm systemB – Did a burglary occur?E – Did an earthquake occur?A – Did the alarm sound off?M – Mary callsJ – John calls• Lets use our knowledge of the domain!Reconstructing a networkAJ MB EB – Did a burglary occur?E – Did an earthquake occur?A – Did the alarm sound off?M – Mary callsJ – John callsReconstructing a networkAJ MB ENumber of parameters:A: 4B: 1E: 1J: 2M: 2A total of 10 parametersBy relying on domain knowledge we saved 21 parameters!Constructing a Bayesian network: Revisited• Step 1: Identify the random variables• Step 2: Determine the conditional dependencies- Select on ordering of the variables- Add them one at a time- For each new variable X added select the minimal subset of nodes as parents such that X is independent from all other nodes in the current network given its parents.• Step 3: Populate the CPTs- From examples using density estimationReconstructing a networkAJ MB ESuppose we wanted to add a new variable to the network:R – Did the radio announce that there was an earthquake? How should we insert it?RExample: Bayesian networks for cancer detectionExample: Gene expression networkQuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Conditional independenceAJ MB E• Two variables x,y are said to be conditionally independent given a third variable z if p(x,y|z) = p(x|z)p(y|z) • In a Bayesian network a variable is conditionally independent of all other variables given it Markov blanketMarkov blanket: All parent, children's and co-parents of childrenMarkov blankets: ExamplesAJ MB EMarkov blanket for B: E, AMarkov blanket for A: B, E, J, MBayesian network: Inference• Once the network is constructed, we can use algorithms for inferring the values of unobserved variables.• For example, in our previous network the only observed variables are the phone call and the radio announcement. However, what we are really interested in is whether there was a burglary or not.• How can we determine that?Inference• Lets start with a simpler question- How can we compute a joint distribution from the network?- For example, P(B,E,A,J, M)?• Answer:- That’s easy, lets use the networkComputing: P(B,E,A,J, M)AJ MB EP(B)=.05P(E)=.1P(A|B,E) =.95P(A|B,E) = .85P(A|  B,E) =.5P(A|  B,  E) = .05P(J|A) )=.7P(J|A) = .05P(M|A) =.8P(M|A) = .15P(B,E,A,J, M) = P(B)P(E)P(A | B, E) P(J | A)P(M | A)= 0.05*0.9*.85*.7*.2= 0.005355Computing: P(B,E,A,J, M)AJ MB EP(B)=.05P(E)=.1P(A|B,E) )=.95P(A|B,E) = .85P(A|  B,E) )=.5P(A|  B,  E) = .05P(J|A) )=.7P(J|A) = .05P(M|A) )=.8P(M|A) = .15P(B,E,A,J, M) = P(B)P(E)P(A | B, E) P(J | A)P(M | A)= 0.05*0.9*.85*.7*.2= 0.005355We can easily compute a complete joint distribution. What about partial distributions? Conditional distributions?Inference• We are interested in queries of the form:P(B | J,M)• This can also be written as a joint:• How do we compute the new joint?),,(),,(),,(),|(MJBPMJBPMJBPMJBPAJ MB EInference in Bayesian networks• We will discuss three methods:1. Enumeration 2. Variable elimination3. Stochastic inferenceComputing partial joints),,(),,(),,(),|(MJBPMJBPMJBPMJBPSum all instances with these settings (the sum is over the possible assignments to the other two variables, E and A)Computing: P(B,J, M)AJ MB EP(B)=.05P(E)=.1P(A|B,E) )=.95P(A|B,E) = .85P(A|  B,E) )=.5P(A|  B,  E) = .05P(J|A) )=.7P(J|A) = .05P(M|A) )=.8P(M|A) = .15P(B,J, M) = P(B,J, M,A,E)+ P(B,J, M,  A,E) + P(B,J, M,A,  E) + P(B,J, M,  A,  E) =0.0007+0.00001+0.005+0.0003 = 0.00601Computing partial joints),,(),,(),,(),|(MJBPMJBPMJBPMJBPSum all instances with these settings (the sum is over the possible assignments to the other two variables, E and A)• This method can be improved by re-using calculations (similar to dynamic programming)•

View Full Document