Learning Deep Generative Models9.520 Class 19Ruslan SalakhutdinovBCS and CSAIL, MIT1Talk Outline1. Introduction.1. Autoencoders, Boltzmann Machines.2. Deep Belief Networks (DBN’s).3. Learning Feature Hierarchies with DBN’s.4. Deep Boltzmann Machines (DBM’s).5. Extensions.2Long-term GoalRaw pixel valuesSlightly higher level representation...High level representationTiger rests on the grassLearn progressively complex high-level representations.Use bottom-up + top-down cues.Build high-level repr esentat ionsfrom large unlabeled datasets.Labeled data is used to only slightlyadjust the model for a specific task.3ChallengesRaw pixel valuesSlightly higher level representation...High level representationTiger rests on the grassDeep models are composed ofseveral layers of nonlinear modules.Associated loss functions arealmost always non-convex.Many bad local optima makes deepmodels very difficult to optimize.Idea: lear n one layer at a t ime.4Key RequirementsRaw pixel valuesSlightly higher level representation...High level representationTiger rests on the grassOnline Learning.Learning should scale to largedatasets, containing millions orbillions examples.Inferring high-level representationshould be fast: fraction of asecond.Demo.5AutoencodersDecodervhvCode LayerEncoderWWConsider having D binary visible units v and K binaryhidden units h.Idea: transfor m data into a (low-dimensional) code andthen reconstruct the data from the code.6AutoencodersDecodervhvCode LayerEncoderWWEncoder: hj=11 + exp(−PiviWij), j = 1, ..., K.Decoder: ˆvi=11 + exp(−PjhjWij), i = 1, ..., D.7AutoencodersDecodervhvCode LayerEncoderWWMinimize reconstruction error:minWLoss(v,ˆv, W ) + Penalty(h, W )Loss functions: cross-entropy or squared loss.Typically, one imposes l1regularization on hidden units h and l2regularization on parameters W (related to sp arse coding).8Building Block: RBM’sProbabilistic Analog: Restricted Boltzmann Machines.hvWVisible stochastic binary units v are connected to hiddenstochastic binary f eat ur e detector s h:P (v, h) =1Z(W )expXijvihjWij Markov Random Fields, Log-linear Models, B oltzmann machines.9Building Block: RBM’shvWP (v, h) =1Z(W )expXijvihjWij ,where Z(W ) is known as a partition function:Z(W ) =Xh,vexpXijvihjWij .10Inference with RBM’sP (v, h) =1ZexpXijvihjWijConditional distributions over hidden and visible unit s aregiven by logistic functions:p(hj= 1|v) =11 + exp(−PiviWij)p(vi= 1|h) =11 + exp(−PjhjWji)Key Observation: Given v, we can easily infer thedistribution over hidden units.11Learning with RBM’sPmodel(v) =XhP (v, h) =1ZXhexpXijvihjWijMaximum Likelihood lear ning:∂ log P (v)∂Wij= EPdata[vihj] − EPmodel[vihj],where Pdata(h, v) = P (h|v)Pdata(v), withPdata(v) representing the empirical distribut ion.However, computing EPmodelis difficult due to the presenceof a partition function Z.12Contrastive Divergencei ijijdata1<v h >ijj <v h >i j<v h >i jinfdata reconstruction fantasyMaximum Likelihood lear ning:∆Wij= EPdata[vihj] − EPmodel[vihj]Contrastive Divergence learning:∆Wij= EPdata[vihj] − EPT[vihj]PTrepresent s a distribut ion defined by running a Gibbs c h ain,initialized at the data, for T full steps.13Learning with RBM’sMNIST Digits NORB 3D ObjectsLearned W14Learning with RBM’sInputExtracted FeaturesPLogisticReconstruction15Modeling DocumentsRestricted Boltzm ann Machines: 2-layer modules.hvW• Visible units ar e multinomials over wor d counts.• Hidden units ar e topic detectors.16Extracted Latent Topics20 Newsgroup 2−D Topic Spacecomp.graphicsrec.sport.hockey sci.cryptographysoc.religion.christiantalk.politics.gunstalk.politics.mideast17Collaborative FilteringFor m of Matrix Factor izat ion.hvW• Visible units ar e multinomials over rating values.• Hidden units ar e user preference detectors.Used in Netflix competition.18Deep Belief Networks (DBN’s)• There are limitations on the types of structure that canbe represented efficiently by a single layer of hiddenvariables.• We would like to efficiently lear n multi-layer modelsusing a large supply of high-dimensional highly-structured unlabeled sensory input.19Learning DBN’sRBMRBMRBMGreedy, layer-by-layer learning:• Learn and Freeze W1.• Sample h1∼ P (h1|v; W1).Treat h1as if it were data.• Learn and Freeze W2.• ...Learn high-level representations.20Learning DBN’sRBMRBMRBMUnder certain conditions addingan extra layer always improvesa lower bound on the log probabilityof data.Each layer of features capt ureshigh-order cor relations betweenthe activities of units in thelayer below.21Learning DBN’s1st-layer features 2nd-layer features22Density Es timationDBN samples Mixture of Bernoulli’sMoB, test log- prob: -137.64 per digit.DBN, t est log-prob: -85.97 per digit.Difference of over 50 nats is quite large.23Learning Deep AutoencodersWWWW12000RBM220001000500RBM500RBM1000RBM3430Pretraining consists of learning a stackof RBMs.Each RBM has only one layer of featuredetectors.The learned featur e activations of oneRBM are used as the data for trainingthe next RBM in the stack.24Learning Deep AutoencodersWWWWWWWW500100020005002000UnrollingEncoder123304321Code layerDecoder41000TTTTAfter pretraining multiple layers, themodel is unrolled to create a deepautoencoderInitially encoder and decoder networksuse the same weights.The global fine-tuning usesbackpropagation through the wholeautoencoder to fine-tune the weights foroptimal reconstruction.25Learning Deep AutoencodersWWW +εWWWWW +εW +εW +εWW +εW +εW +ε+εWWWWWW12000RBM220001000500500100010005001 1200020005005001000100020005002000T4TRBMPretraining Unrolling1000RBM343030Fine−tuning4 42 23 34T53T62T71T8Encoder12330432T1TCode layerDecoderRBMTop26Learning Deep AutoencodersWe used a 25 × 25-2000-1000-500-30 aut oencoder to extract 30-Dreal-valued codes for Olivetti face patches (7 hidden layer s is u suallyhard to train).Top Random samples from the test dataset; Middle reconstruct ionsby t he 30-dimensional d e e p autoencoder; and Bottom reconst ructionsby 30-dimensional PCA.27Dimensionality ReductionLegal/JudicialLeading Economic Indicators European Community Monetary/Economic Accounts/Earnings Interbank MarketsGovernment Borrowings Disasters and Accidents Energy MarketsThe Reuters Corpus: 804,414 newswire stories.Simple “bag-of-wor ds” representation.2000-500-250-125- 2 autoencoder.28Document RetrievalPrecision-recall curves when a 10-D query document
View Full Document