DOC PREVIEW
WSU CSE 6363 - Bayesian Learning

This preview shows page 1-2-3-24-25-26-27-49-50-51 out of 51 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Bayesian Learning• Bayes Theorem• MAP, ML hypotheses• MAP learners• Minimum description length principle• Bayes optimal classifier• Naive Bayes learner• Example: Learning over text data• Bayesian belief networks• Expectation Maximization algorithm1Two Roles for Bayesian MethodsProvides practical learning algorithms:• Naive Bayes learning• Bayesian belief network learning• Combine prior knowledge (prior probabilities) withobserved data• Requires prior probabilitiesProvides useful conceptual framework• Provides “gold standard” for evaluating otherlearning algorithms• Additional insight into Occam’s razor2Bayes TheoremP (h|D)=P (D|h)P (h)P (D)• P (h) = prior probability of hypothesis h• P (D) = prior probability of training data D• P (h|D) = probability of h given D• P (D|h) = probability of D given h3Choosing HypothesesP (h|D)=P (D|h)P (h)P (D)Generally want the most probable hypothesis given thetraining dataMaximum a posteriori hypothesis hMAP:hMAP= arg maxh∈HP (h|D)= arg maxh∈HP (D|h)P (h)P (D)= arg maxh∈HP (D|h)P (h)If assume P (hi)=P (hj) then can further simplify, andchoose the Maximum likelihood (ML) hypothesishML= arg maxhi∈HP (D|hi)4Bayes TheoremDoes patient have cancer or not?A patient takes a lab test and the result comesback positive. The test returns a correct positiveresult in only 98% of the cases in which thedisease is actually present, and a correct negativeresult in only 97% of the cases in which thedisease is not present. Furthermore, .008 of theentire population have this cancer.P (cancer)= P (¬cancer)=P (+|cancer)= P (−|cancer)=P (+|¬cancer)= P (−|¬cancer)=5Basic Formulas for Probabilities• Product Rule: probability P (A ∧ B)ofaconjunction of two events A and B:P (A ∧ B)=P (A|B)P (B)=P (B|A)P (A)• Sum Rule: probability of a disjunction of two eventsA and B:P (A ∨ B)=P (A)+P (B) − P (A ∧ B)• Theorem of total probability:ifeventsA1,...,Anare mutually exclusive withni=1P (Ai) = 1, thenP (B)=ni=1P (B|Ai)P (Ai)6Brute Force MAP Hypothesis Learner1. For each hypothesis h in H, calculate the posteriorprobabilityP (h|D)=P (D|h)P (h)P (D)2. Output the hypothesis hMAPwith the highestposterior probabilityhMAP= argmaxh∈HP (h|D)7Relation to Concept LearningConsider our usual concept learning task• instance space X, hypothesis space H, trainingexamples D• consider the FindS learning algorithm (outputs mostspecific hypothesis from the version space VSH,D)What would Bayes rule produce as the MAPhypothesis?Does F indS output a MAP hypothesis??8Relation to Concept LearningAssume fixed set of instances x1,...,xmAssume D is the set of classificationsD = c(x1),...,c(xm)Choose P (D|h):9Relation to Concept LearningAssume fixed set of instances x1,...,xmAssume D is the set of classificationsD = c(x1),...,c(xm)Choose P (D|h)• P (D|h)=1ifh consistent with D• P (D|h) = 0 otherwiseChoose P (h)tobeuniform distribution• P (h)=1|H|for all h in HThen,P (h|D)=⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩1|VSH,D|if h is consistent with D0 otherwise10Evolution of Posterior Probabilitieshypotheses hypotheses hypothesesP(h|D1,D2)P(h|D1)Ph)(a()b()c()11Characterizing Learning Algorithms by Equiv-alent MAP LearnersInductive systemOutput hypothesesOutput hypothesesBrute forceMAP learnerCandidateEliminationAlgorithm Prior assumptions made explicitP(h) uniformP(D|h) = 0 if inconsistent, = 1 if consistentEquivalent Bayesian inference systemTraining examples DHypothesis space H Hypothesis space H Training examples D12Learning A Real Valued FunctionhMLfeyxConsider any real-valued target function fTraining examples xi,di, where diis noisy trainingvalue• di= f(xi)+ei• eiis random variable (noise) drawn independently foreach xiaccording to some Gaussian distribution withmean=0Then the maximum likelihood hypothesis hMLis theone that minimizes the sum of squared errors:hML= arg minh∈Hmi=1(di− h(xi))213Learning A Real Valued FunctionhML= argmaxh∈Hp(D|h)= argmaxh∈Hmi=1p(di|h)= argmaxh∈Hmi=11√2πσ2e−12(di−h(xi)σ)2Maximize natural log of this instead...hML= argmaxh∈Hmi=1ln1√2πσ2−12⎛⎜⎜⎝di− h(xi)σ⎞⎟⎟⎠2= argmaxh∈Hmi=1−12⎛⎜⎜⎝di− h(xi)σ⎞⎟⎟⎠2= argmaxh∈Hmi=1−(di− h(xi))2= argminh∈Hmi=1(di− h(xi))214Learning to Predict ProbabilitiesConsider predicting survival probability from patientdataTraining examples xi,di, where diis1or0Want to train neural network to output a probabilitygiven xi(nota0or1)In this case can showhML= argmaxh∈Hmi=1diln h(xi)+(1− di)ln(1− h(xi))Weight update rule for a sigmoid unit:wjk← wjk+∆wjkwhere∆wjk= ηmi=1(di− h(xi)) xijk15Minimum Description Length PrincipleOccam’s razor: prefer the shortest hypothesisMDL: prefer the hypothesis h that minimizeshMDL= argminh∈HLC1(h)+LC2(D|h)where LC(x) is the description length of x underencoding CExample: H = decision trees, D = training data labels• LC1(h) is # bits to describe tree h• LC2(D|h) is # bits to describe D given h– Note LC2(D|h) = 0 if examples classified perfectlyby h. Need only describe exceptions• Hence hMDLtrades off tree size for training errors16Minimum Description Length PrinciplehMAP= arg maxh∈HP (D|h)P (h)= arg maxh∈Hlog2P (D|h) + log2P (h)= arg minh∈H−log2P (D|h) − log2P (h)(1)Interesting fact from information theory:The optimal (shortest expected coding length)code for an event with probability p is −log2pbits.So interpret (1):•−log2P (h) is length of h under optimal code•−log2P (D|h) is length of D given h under optimalcode→ prefer the hypothesis that minimizeslength(h)+length(misclassifications)17Most Probable Classification of New InstancesSo far we’ve sought the most probable hypothesis giventhe data D (i.e., hMAP)Given new instance x, what is its most probableclassification?• hMAP(x) is not the most probable classification!Consider:• Three possible hypotheses:P (h1|D)=.4,P(h2|D)=.3,P(h3|D)=.3• Given new instance x,h1(x)=+,h2(x)=−,h3(x)=−• What’s most probable classification of x?18Bayes Optimal ClassifierBayes optimal classification:arg maxvj∈Vhi∈HP (vj|hi)P (hi|D)Example:P (h1|D)=.4,P(−|h1)=0,P(+|h1)=1P (h2|D)=.3,P(−|h2)=1,P(+|h2)=0P (h3|D)=.3,P(−|h3)=1,P(+|h3)=0thereforehi∈HP (+|hi)P (hi|D)=.4hi∈HP (−|hi)P (hi|D)=.6andarg maxvj∈Vhi∈HP (vj|hi)P (hi|D)=−19Gibbs


View Full Document

WSU CSE 6363 - Bayesian Learning

Download Bayesian Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Bayesian Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Bayesian Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?