Review•Gibbs sampling‣MH with proposal ‣Q(X | X’) = P(XB(i) | X¬B(i)) I(X¬B(i) = X’¬B(i)) / #B‣failure mode: “lock-down”•Relational learning (properties of sets of entities)‣document clustering, recommender systems, eigenfaces1Review•Latent-variable models•PCA, pPCA, Bayesian PCA‣everything Gaussian‣E(X | U,V) = UVT‣MLE: use SVD•Mean subtraction, example weights2PageRank•SVD is pretty useful: turns out to be main computational step in other models too•A famous one: PageRank‣Given: web graph (V, E)‣Predict: which pages are important3PageRank: adjacency matrix4Random surfer model‣W. p. α:‣W. p. (1–α):‣Intuition: page is important if a random surfer is likely to land there5Stationary distributionA B C D00.10.20.30.40.56Thought experiment•What if A is symmetric?‣note: we’re going to stop distinguishing A, A’•So, stationary dist’n for symmetric A is:•What do people do instead?7Spectral embedding•Another famous model: spectral embedding (and its cousin, spectral clustering)•Embedding: assign low-D coordinates to vertices (e.g., web pages) so that similar nodes in graph ⇒ nearby coordinates‣A, B similar = random surfer tends to reach the same places when starting from A or B8Where does random surfer reach?•Given graph: •Start from distribution π‣after 1 step: P(k | π, 1-step) = ‣after 2 steps: P(k | π, 2-step) = ‣after t steps:9Similarity•A, B similar = random surfer tends to reach the same places when starting from A or B•P(k | π, t-step) = ‣If π has all mass on i:‣Compare i & j: ‣Role of Σt:10Role of Σt (real data)2 46 81000.20.40.60.81 t=1t=3t=5t=1011Example: dolphins•62-dolphin social network near Doubtful Sound, New Zealand‣Aij = 1 if dolphin i friends dolphin j(Lusseau et al., 2003)0 20 40 600102030405060nz = 31812Dolphin network!!"# !!"$! !"$!"#!"%!!"%!!"#!!"$!!"$!"#!"%13Comparisons!!"#!!"$ !!"%! !"%!"$!"#!!"#!!"$!!"%!!"%!"$!"#!"&!!"# !!"$! !"$!"#!!"#!!"%!!"$!!"&!!"&!"$!"%!"#random embedding of dolphin dataspectral embedding of random data14Spectral clustering•Use your favorite clustering algorithm on coordinates from spectral embedding!!"# !!"$! !"$!"#!"%!!"%!!"#!!"$!!"$!"#!"%15PCA: the good, the bad, and the ugly•The good: simple, successful•The bad: linear, Gaussian‣E(X) = UVT‣X, U, V ~ Gaussian•The ugly: failure to generalize to new entities16Consistency•Linear & logistic regression are consistent•What would consistency mean for PCA?‣forget about row/col means for now•Consistency:‣#users, #movies, #ratings (= nnz(W))‣numel(U), numel(V)‣consistency = 17Failure to generalize•What does this mean for generalization?‣new user’s rating of moviej: only info is‣new movie rated by useri: only info is‣all our carefully-learned factors give us:•Generalization is:18Hierarchical modelold, non-hierarchical model19Benefit of hierarchy•Now: only k μU latents, k μV latents (and corresponding σs)‣can get consistency for these if we observe more and more Xij•For a new user or movie:20Mean subtraction•Can now see that mean subtraction is a special case of our hierarchical model‣Fix Vj1 = 1 for all j; then Ui1 = ‣Fix Ui2 = 1 for all i; then Vj2 = ‣global mean:21What about the second rating for a new user?•Estimating Ui from one rating:‣knowing μU:‣result:•How should we fix?•Note: often we have only a few ratings per user22MCMC for PCA•Can do Bayesian inference by Gibbs sampling—for simplicity, assume σs known23Recognizing a Gaussian•Suppose X ~ N(X | μ, σ2)•L = –log P(X=x | μ, σ2) =‣dL/dx =‣d2L/dx2 =•So: if we see d2L/dx2 = a, dL/dx = a(x – b)‣μ = σ2 =24Gibbs step for an element of μU25Gibbs step for an element of U26In reality•We’d do blocked Gibbs instead•Blocks contain entire rows of U or V‣take gradient, Hessian to get mean, covariance‣formulas look a lot like linear regression (normal equations)•And, we’d fit σU, σV too‣sample 1/σ2 from a Gamma (or Σ–1 from a Wishart) distribution27Nonlinearity: conjunctive featuresP(rent)ComedyForeign28Disjunctive featuresP(rent)ComedyForeign29“Other”P(rent)ComedyForeign30Non-Gaussian•X, U, and V could each be non-Gaussian‣e.g., binary!‣rents(U, M), comedy(M), female(U)•For X: predicting –0.1 instead of 0 is only as bad as predicting +0.1 instead of 0•For U, V: might infer –17% comedy or 32% female31Logistic PCA•Regular PCA: Xij ~ N(Ui ⋅ Vj, σ2)•Logistic PCA:32More generally…•Can have‣Xij ∼ Poisson(μij), μij = exp(Ui ⋅ Vj)‣Xij ∼ Bernoulli(μij), μij = σ(Ui ⋅ Vj)‣…•Called exponential family PCA•Might expect optimization to be difficult33Application: fMRIAugmented Brain Imaging39co-occurs(dog,cat) = 1co-occurs(dog,walk) = 1co-occurs(dog,physics) = 0co-occurs(dog,cupcake) = 0:-):-));->fMRIfMRIfMRI(Word + Picture) stimulusDogText CorpusDogBrain activityMitchell et al. (2008) Predicting Human Brain Activity Associated with the Meaning of Nouns. Science.100000010001100001010010101000000StimulusWordsXStimulusVoxelsYstimulus: “dog”stimulus: “cat”stimulus: “hammer”credit: Ajit Singh34Results (logistic PCA)credit: Ajit SinghPredictive accuracy40Words + Voxels Voxels00.20.40.60.811.21.4Mean Squared Error HB!CMFH!CMFCMFWords + Voxels Voxels00.20.40.60.811.21.4Mean Squared Error HB!CMFH!CMFCMFBetterLower isBetterLower isY (fMRI data): Hold-out Y (fMRI data): Fold-inHierarchical Bayesian ModelHierarchical maximum a posteroriMaximum a posteriori (fixed hyperparameters)Just using fMRI dataAugmenting fMRI data with word co-occurrenceJust using fMRI dataAugmenting fMRI data with word
View Full Document