DOC PREVIEW
Stanford CS 224 - Study Notes

This preview shows page 1-2-3-4-5 out of 15 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Natural Language Processing (CS 224N) Chris CoxProf. Manning Michael SiliskiCharlie StockmanNLP Final Programming Project: Clustering Web Search ResultsIntroductionFor our project we built a system that separated the results of a Google search (or any list of URLs) into different groups. Such as system would clearly be useful as a major advance in search technology. For example, when searching for the word “Guinness”, it would probably aid browsing if the results were separated into, say, three separate groups:those related to the beer, those related to the world records, and those related to the actor (Alec Guinness). We decided to use clustering techniques to tackle the problem. This solution seemed to be reasonable because clustering is performed unsupervised (i.e. without any sort of training data), and we of course would not have any kind of training data to go along withour lists of websites. In fact, some search engines already use some sort of clustering, such as Vivisimo. While many other web searches, such as Yahoo! and Google, classify search results into categories in a pre-defined hierarchy, true clustering would obviate the need for such hand-built hierarchies and would be much more flexible (and useful) in the way it clustered search results.BackgroundThe purpose of clustering is to separate data elements (in this case web pages) into distinct groups such that similarities within groups are very high and similarities acrossgroups are low. There are two basic structures of clustering: hierarchical clustering and non-hierarchical, or flat, clustering. In hierarchical clustering, objects are placed in their most specific clusters, and these clusters then act as sub-clusters to larger clusters, and so on, forming a basic tree structure where all objects are part of the highest, most general cluster. In non-hierarchical clustering, objects are placed in only one cluster, and the relation between clusters is not really important. In our system, we performed non-hierarchical clustering, assigning documents to separate clusters and not trying to define relationships between the clusters.Another difference between clustering structures is whether they make hard or soft assignments. In hard assignment objects are put into one cluster and only one cluster. In soft assignment objects are allowed to be only partly assigned to a cluster or assigned to multiple clusters. In our system we make hard assignments, although one could also make a case for a system that put some webpages in multiple clusters if they applied to both. There are several techniques for flat clustering. The most simple algorithm, and the one we used, is K-means. In K-means, it is known beforehand how many clusters there are supposed to be. Furthermore, each data element is represented by a set, or vector, of features. These features could be the word count of the document, the x and y coordinates of some point, etc. The centroid of each cluster is a vector of an equal number of features as the data elements that represents the average of the data elements within the cluster. The centroids are initially set randomly (though we diverge from this),and after they are set each data element is assigned to the cluster with the closest centroid.The centroids of each cluster are then recalculated as the average of all of the dataelements within the cluster, and the data elements are afterwards reassigned. This iteration continues until the clusters stop changing. Although K-means a standard clustering algorithm, it is usually effective. Some similar clustering techniques are Clustering By Committee, or CBC, and average agglomerative clustering. In CBC, the centroids are calculated as the average of just some of the members of a cluster. This lessens the impact of outliers and data elements that only really sort of belong in the cluster. In average agglomerative clustering, each pair of data elements are compared, and the most similar pair is made into a cluster whosecentroid is the average of the two. This process is then repeated, but now the centroid of the new cluster replaces the data elements belonging to it. In other words each point is now compared to each other point or cluster, and the most similar pair forms a new cluster. This continues until the desired number of clusters is achieved.An example of a system that uses techniques like these for document clustering is the Scatter/Gather interface designed at Xerox PARC. The Scatter/Gather interface begins with a certain number of very general document clusters. The user then chooses a subset of these clusters, the documents of which are then rescattered and reassigned to more specific clusters (hence the Scatter/Gather name). By this method, the user is able to create more and more specific clusters from which to choose text documents. The clustering used in this interface is similar to K-means, though probably more complicated. In fact, the Scatter/Gather interface provided some of the basis for our method of feature selection, as we also tried to define features as words that would appearin certain documents but not others. The other basic non-hierarchical clustering technique is to use the EM algorithm.Here you view each cluster as a Gaussian mixture (a normal distribution, bell curve), or as cause that helps create the underlying data. For each object you can compute the probability that each cluster, or Gaussian, caused that object. Like all EM algorithms, it is an iterative process in which you estimate the parameters, in this case of the Gaussian mixtures, and use these parameters to calculate the underlying data, in this case what clusters the object belong to (or partly belong to). Although the EM algorithm seems significantly different from the K-means technique, it is really just a soft clustering K-means algorithm where objects can belong to multiple clusters (or Gaussians).The ProblemWith this brief overview of clustering, we now present our process of creating a clustering system for Google searches. We wanted to take a list of Google search results from a and to determine in an unsupervised manner which sets of documents were relatedto each other.We built data sets of the first several hundred search result URLs from various Google searches. Our clustering program takes these URLs as input, attempts to download the referenced documents, and clusters those that download successfully.Feature


View Full Document

Stanford CS 224 - Study Notes

Documents in this Course
Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?