More PCA Latent Semantic Analysis and Multidimensional Scaling 36 350 Data Mining 21 September 2009 Reading Principles of Data Mining section 14 3 3 on latent semantic indexing Contents 1 Latent Semantic Analysis Yet More PCA and Yet More Information Retrieval 1 1 1 Principal Components of the Times Stories 3 2 PCA for Visualization 1 5 Latent Semantic Analysis Yet More PCA and Yet More Information Retrieval Back when I was talking about abstraction I mentioned that dimension reduction is something that can be layered in between our original representations like bags of words and techniques that work in feature space like similarity searching or nearest neighbors That is rather than looking at the original features we can apply the same procedures to the reduced synthetic features we get from doing dimension reduction This can have advantages in terms of speed the vectors are smaller memory ditto and even accuracy since there are fewer parameters explicit or implicit to learn One particularly nice application of this idea is to combine information retrieval with the use of principal components in dimension reduction This is called latent semantic analysis or latent semantic indexing Remember from last time that the principal components we get from a collection of vectors depend on the covariance across the features Features which are strongly correlated with each other will have projections on to the principal components which are very close to each other while features which are weakly correlated or not at all will have nearly or exactly orthogonal projections they ll project on to different principal components 1 Now suppose that the features are words in a big matrix of bag of word vectors Two words being correlated means that they tend to appear together in the documents or not at all But this tendency needn t be absolute it can be partial because the words mean slightly different things or because of stylistic differences etc But the projections of those features on to the principal components will generally be more similar than the original features are To see how this can be useful imagine we have a collection of documents a corpus which we want to search for documents about agriculture It s entirely possible that many documents on this topic don t actually contain the word agriculture just closely related words like farming A simple feature vector search on agriculture will miss them But it s very likely that the occurrence of these related words is well correlated with the occurrence of agriculture This means that all these words will have similar projections on to the principal components so if we do similarity searching on the images of the query and the corpus a search for agriculture will turn up documents that use farming a lot To see why this is latent semantic indexing think about what goes into coming up with an index for a book by hand Someone draws up a list of topics and then goes through the book noting all the passages which refer to the topic and maybe a little bit of what they say there For example here s the start of the entry for Agriculture in the index to Adam Smith s The Wealth of Nations Agriculture the labour of does not admit of such subdivisions as manufactures 6 this impossibility of separation prevents agriculture from improving equally with manufactures 6 natural state of in a new colony 92 requires more knowledge and experience than most mechanical professions and yet is carried on without any restrictions 127 the terms of rent how adjusted between landlord and tenant 144 is extended by good roads and navigable canals 147 under what circumstances pasture land is more valuable than arable 149 gardening not a very gainful employment 152 3 vines the most profitable article of culture 154 estimates of profit from projects very fallacious ib cattle and tillage mutually improve each other 220 and so on Agriculture is an important topic in The Wealth of Nations It s asking a lot to hope for a computer to be able to do something like this but we could at least hope for a list of pages like 6 92 126 144 147 152 3 154 220 One could imagine doing this by treating each page as its own document forming its bag of words vector and then returning the list of pages with a non zero entry for the feature agriculture This will fail only two of those nine pages actually contains that word and this is pretty typical On the other hand they are full of words strongly correlated with agriculture so asking for the pages which are most similar in their principal components 2 projection to that word will work great 1 At first glance and maybe even second this seems like a wonderful trick for extracting meaning semantics from pure correlations Of course there are also all sorts of ways it can fail not least from spurious correlations If our training corpus happens to contain lots of documents which mention farming and Kansas as well as farming and agriculture latent semantic indexing will not make a big distinction between the relationship between agriculture and farming which is genuinely semantic and that between Kansas and farming which is accidental and probably wouldn t show up in say a corpus collected from Europe Despite this susceptibility to spurious correlations latent semantic indexing is an extremely useful technique in practice and the foundational papers Deerwester et al 1990 Landauer and Dumais 1997 are worth reading you can find them on Blackboard or the course website 1 1 Principal Components of the Times Stories To get a more concrete sense of how latent semantic analysis works and how it reveals semantic information let s apply it to the Times stories The object nyt frame contains the stories as usual after inverse document frequency weighting and Euclidean length normalization with the first column containing class labels nyt pca prcomp nyt frame 1 nyt latent sem nyt pca rotation We need to omit the first column in the first command because it contains categorical variables and PCA doesn t apply to them The second command just picks out the matrix of projections of the features on to the components this is called rotation because it can be thought of as rotating the coordinate axes in feature vector space Now that we ve done this let s look at what the leading components are signif sort nyt latent sem 1 decreasing TRUE 1 30 2 music trio theater orchestra composers 0 110 0 084 0 083 0 067 0 059 theaters m festival east program 0 055 0 054 0 051 0 049
View Full Document
Unlocking...