Unformatted text preview:

Statistical learningChapter 20, Sections 1–3Chapter 20, Sections 1–3 1Outline♦ Bayesian learning♦ Maximum a posteriori and ma ximum likelihood learning♦ Bayes net learning– ML parameter learning with complete data– linear regressionChapter 20, Sections 1–3 2Full Bayesian learningView learning as Bayesian updating of a probability distributionover thehypothesis spaceH is the hypothesis variable, values h1, h2, . . ., prior P(H)jth observation djgives the outcome of random variable Djtraining data d = d1, . . . , dNGiven the data so far, each hypothesis has a posterior probability:P (hi|d) = αP (d|hi)P (hi)where P (d|hi) is called the likeliho odPredictions use a likelihood-weighted average over the hypotheses:P(X|d) = ΣiP(X|d, hi)P (hi|d) = ΣiP(X|hi)P (hi|d)No need to pick one best-guess hypothesis!Chapter 20, Sections 1–3 3ExampleSuppose there are five kinds of bags of candies:10% areh1: 100% cherry candies20% areh2: 75% cherry candies + 2 5% lime candies40% areh3: 50% cherry candies + 5 0% lime candies20% areh4: 25% cherry candies + 7 5% lime candies10% areh5: 100% lime candiesThen we observe candies drawn from some bag :What kind of bag is it? What flavour will the next candy be?Chapter 20, Sections 1–3 4Posterior probability of hypotheses00.20.40.60.810 2 4 6 8 10Posterior probability of hypothesisNumber of samples in dP(h1 | d)P(h2 | d)P(h3 | d)P(h4 | d)P(h5 | d)Chapter 20, Sections 1–3 5Prediction probability0.40.50.60.70.80.910 2 4 6 8 10P(next candy is lime | d)Number of samples in dChapter 20, Sections 1–3 6MAP approximationSumming over the hypothesis space is often intractable(e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes)Maximum a posteriori (MAP) learning: choose hMAPmaximizing P (hi|d)I.e., maximize P (d|hi)P (hi) or log P (d|hi) + log P (hi)Log terms ca n be viewed as (negative of)bits to encode data given hypothesis + bits to encode hypothesisThis is the basic idea of minimum description length (MDL) learningFor deterministic hypotheses, P (d|hi) is 1 if consistent, 0 otherwise⇒ MAP = simplest consistent hypothesis (cf. science)Chapter 20, Sections 1–3 7ML approximationFor large data sets, prior becomes irrelevantMaximum likelihood (ML) learning: choose hMLmaximizing P (d|hi)I.e., simply get the best fit to the data; identical to MAP for uniform prior(which is reasonable if all hypotheses are of the same complexity)ML is the “standard” (non-Bayesian) statistical learning methodChapter 20, Sections 1–3 8ML parameter learning in Bayes netsBag from a new ma nufacturer; fraction θ of cherry candies?FlavorP F=cherry( )θAny θ is possible: continuum of hypotheses hθθis a parameter for this simple (binomial) family of modelsSuppose we unwrap N ca ndies, c cherries and ℓ = N − c limesThese arei.i.d. (independent, identically distributed) observations, soP (d|hθ) =NYj = 1P (dj|hθ) = θc· (1 − θ)ℓMaximize this w.r.t. θ—which is easier for the log-likelihood:L(d|hθ) = log P (d|hθ) =NXj = 1log P (dj|hθ) = c log θ + ℓ log (1 − θ)dL(d|hθ)dθ=cθ−ℓ1 − θ= 0 ⇒ θ =cc + ℓ=cNSeems sensible, but causes problems with 0 counts!Chapter 20, Sections 1–3 9Multiple parametersRed/green wrapper depends probabilistically on flavor:P F=cherry( )FlavorWrapperP( )W=red | FFcherry2limeθ1θθLikelihood for, e.g., cherry candy in green wrapper:P (F = cherry, W = green|hθ,θ1,θ2)= P (F = cherry|hθ,θ1,θ2)P (W = green|F = cherry, hθ,θ1,θ2)= θ · (1 − θ1)N candies, rcred-wrapped cherry candies, etc.:P (d|hθ,θ1,θ2) = θc(1 − θ)ℓ· θrc1(1 − θ1)gc· θrℓ2(1 − θ2)gℓL = [c log θ + ℓ log( 1 − θ)]+ [rclog θ1+ gclog( 1 − θ1)]+ [rℓlog θ2+ gℓlog(1 − θ2)]Chapter 20, Sections 1–3 10Multiple parameters contd.Derivatives of L contain only the relevant parameter:∂L∂θ=cθ−ℓ1 − θ= 0 ⇒ θ =cc + ℓ∂L∂θ1=rcθ1−gc1 − θ1= 0 ⇒ θ1=rcrc+ gc∂L∂θ2=rℓθ2−gℓ1 − θ2= 0 ⇒ θ2=rℓrℓ+ gℓWith complete data, parameters can be learned separatelyChapter 20, Sections 1–3 11Example: linear Gaussian model00.20.40.60.81x00.20.40.60.81y00.511.522.533.54P(y |x)00.20.40.60.810 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1yxMaximizing P (y|x) =1√2πσe−(y−(θ1x+θ2))22σ2w.r.t. θ1, θ2= minimizing E =NXj = 1(yj− (θ1xj+ θ2))2That is, minimizing the sum of squared errors gives the ML solutionfor a linear fitassuming Gaussian n o i se of fixed varianceChapter 20, Sections 1–3 12SummaryFull Bayesian learning gives best possible predictions but is intractableMAP learning balances complexity with accuracy on training dataMaximum likelihood assumes uniform prior, OK for large data sets1. Choose a parameterized family of models to describe the datarequires substantial insight and sometimes new models2. Write down the likeliho od of the data as a function of the parametersmay require summing over hidden variables, i.e., inference3. Write down the derivative of the log likelihood w.r.t. each parameter4. Find the parameter values such that the derivatives are zeromay be hard/impossible; modern optimization techniques helpChapter 20, Sections 1–3


View Full Document

UT Arlington CSE 4308 - Statistical learning

Download Statistical learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Statistical learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Statistical learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?