UT CS 388 - Joint Distribution - D1745745

Home> Schools> University of Texas at Austin> Computer Science (CS) > CS 388> Joint Distribution

UT CS 388 - Joint Distribution

School name University of Texas at Austin

Course Cs 388- Natural Language Processing

Pages 37

Download Save

Unformatted text preview:

CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence LabelingJoint DistributionProbabilistic ClassificationBayesian CategorizationBayesian Categorization (cont.)Naïve Bayes Generative ModelNaïve Bayes Inference ProblemNaïve Bayesian CategorizationGenerative vs. Discriminative ModelsLogistic RegressionLogistic Regression as a Log-Linear ModelLogistic Regression TrainingSlide 13Preventing Overfitting in Logistic RegressionMultinomial Logistic Regression (MaxEnt)Graphical ModelsBayesian NetworksConditional Probability TablesJoint Distributions for Bayes NetsNaïve Bayes as a Bayes NetMarkov NetworksDistribution for a Markov NetworkSample Markov NetworkLogistic Regression as a Markov NetGenerative vs. Discriminative Sequence Labeling ModelsClassificationSequence LabelingSimple Linear Chain CRF FeaturesConditional Distribution for Linear Chain CRFAdding Token Features to a CRFFeatures in POS TaggingEnhanced Linear Chain CRF (standard approach)Supervised Learning (Parameter Estimation)Sequence Tagging (Inference)Skip-Chain CRFsCRF ResultsCRF Summary11CS 388: Natural Language Processing:Discriminative Training andConditional Random Fields (CRFs)for Sequence LabelingRaymond J. MooneyUniversity of Texas at Austin2Joint Distribution•The joint probability distribution for a set of random variables, X1,…,Xn gives the probability of every combination of values (an n-dimensional array with vn values if all variables are discrete with v values, all vn values must sum to 1): P(X1,…,Xn)•The marginal probability of all possible conjunctions (assignments of values to some subset of variables) can be calculated by summing the appropriate subset of values from the joint distribution.•Therefore, all conditional probabilities can also be calculated.circle squarered 0.20 0.02blue 0.02 0.01circle squarered 0.05 0.30blue 0.20 0.20positivenegative25.005.020.0)(  circleredP80.025.020.0)()()|( circleredPcircleredpositivePcircleredpositiveP57.03.005.002.020.0)( redP3Probabilistic Classification•Let Y be the random variable for the class which takes values {y1,y2,…ym}.•Let X be the random variable describing an instance consisting of a vector of values for n features <X1,X2…Xn>, let xk be a possible vector value for X and xij a possible value for Xi.•For classification, we need to compute P(Y=yi | X=xk) for i = 1…m•Could be done using joint distribution but this requires estimating an exponential number of parameters.4Bayesian Categorization•Determine category of xk by determining for each yi•P(X=xk) can be determined since categories are complete and disjoint.)()|()()|(kikikixXPyYxXPyYPxXyYPmikikimikixXPyYxXPyYPxXyYP111)()|()()|(miikikyYxXPyYPxXP1)|()()(5Bayesian Categorization (cont.)•Need to know:–Priors: P(Y=yi) –Conditionals: P(X=xk | Y=yi)•P(Y=yi) are easily estimated from data. –If ni of the examples in D are in yi then P(Y=yi) = ni / |D|•Too many possible instances (e.g. 2n for binary features) to estimate all P(X=xk | Y=yi).•Still need to make some sort of independence assumptions about the features to make learning tractable.6Naïve Bayes Generative ModelSize Color Shape Size Color Shape PositiveNegativeposnegposposposnegnegsmmedlglgmedsmsmmedlgredredredredredbluebluegrncirccirccirccircsqrtritricircsqrtrismlgmedsmlgmedlgsmblueredgrnbluegrnredgrnbluecircsqrtricircsqrcirctriCategory7Naïve Bayes Inference ProblemSize Color Shape Size Color Shape PositiveNegativeposnegposposposnegnegsmmedlglgmedsmsmmedlgredredredredredbluebluegrncirccirccirccircsqrtritricircsqrtrismlgmedsmlgmedlgsmblueredgrnbluegrnredgrnbluecircsqrtricircsqrcirctriCategorylg red circ ?? ??8Naïve Bayesian Categorization•If we assume features of an instance are independent given the category (conditionally independent).•Therefore, we then only need to know P(Xi | Y) for each possible pair of a feature-value and a category.•If Y and all Xi and binary, this requires specifying only 2n parameters:–P(Xi=true | Y=true) and P(Xi=true | Y=false) for each Xi–P(Xi=false | Y) = 1 – P(Xi=true | Y)•Compared to specifying 2n parameters without any independence assumptions.)|()|,,()|(121niinYXPYXXXPYXP 9Generative vs. Discriminative Models• Generative models and are not directly designed to maximize the performance of classification. They model the complete joint distribution P(X,Y).•Classification is then done using Bayesian inference given the generative model of the joint distribution.•But a generative model can also be used to perform any other inference task, e.g. P(X1 | X2, …Xn, Y)–“Jack of all trades, master of none.”•Discriminative models are specifically designed and trained to maximize performance of classification. They only model the conditional distribution P(Y | X).•By focusing on modeling the conditional distribution, they generally perform better on classification than generative models when given a reasonable amount of training data.Logistic Regression•Assumes a parametric form for directly estimating P(Y | X). For binary concepts, this is:niiiXwwXYP10)exp(11)|1(•Equivalent to a one-layer backpropagation neural net.–Logistic regression is the source of the sigmoid function used in backpropagation.–Objective function for training is somewhat different.)|1(1)|0( XYPXYP niiiniiiXwwXww1010)exp(1)exp(Logistic Regression as a Log-Linear Model•Logistic regression is basically a linear model, which is demonstrated by taking logs.)|1()|0(1 iff 0 labelAssign XYPXYPYniiiXww10)exp(1niiiXww100niiiXww10ly equivalentor •Also called a maximum entropy model (MaxEnt) because it can be shown that standard training for logistic regression gives the distribution with maximum entropy that is consistent with the training data.Logistic Regression Training•Weights are set during training to maximize the conditional data likelihood : where D is the set of training examples and Yd and Xd denote, respectively, the values of Y and X for example d.),|(argmax WXYPWdDddW•Equivalently viewed as maximizing the conditional log likelihood (CLL)DdddWWXYPW ),|(lnargmaxLogistic Regression Training•Like

View Full Document


School:
Email:
New Password:
Confirm Password:

UT CS 388 - Joint Distribution

Sign up for free to view:

Please select your school