DOC PREVIEW
Stanford CS 224 - Inside Yahoo generator

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Inside Yahoo! GeneratorIntroductionCorpusRepresentation of Search TermsParsing Web PagesSimilarity MetricStop Words and Stop ShinglesValidating the Similarity MeasureNaïve AlgorithmImproving PerformanceSummary of CodeFurther Performance ImprovementsConclusionsAcknowledgementsReferencesInside Yahoo! GeneratorThai TranCS224N Final ProjectJune 6, 2002IntroductionWhen a user performs a search on Yahoo, the first section of results that is returned is called the “Inside Yahoo! Matches.” This section is designed to highlight the content inside the Yahoo network that is most relevant to the user’s search. For example, a searchfor “blue book” will return a link to the Kelley Blue Book used car pricing guide in Yahoo! Autos. Currently, there is an editorial team who manually determines which Yahoo property (Autos, Careers, Travel, etc.) should be featured in the Inside Yahoo match and what page on that property the link should go to. The goal of this project is to employ a machine-learning algorithm to automatically determine the contents of the Inside Yahoo match based on previous choices that the editorial team has made. More generally, the system will determine what other search terms are most similar to the one being examined. For example, if the system learns that “kelley auto guide” is similar to “blue book,” then it is likely that the Inside Yahoo matches for the two search terms are also similar.CorpusI selected 50 existing Inside Yahoo search terms from each of the following six categories: Autos, Careers, Maps, Personals, Sports, and Travel. To remove any possiblebias, I had an automated process select the terms that were most frequently searched for on Yahoo and had high click-though rates. Unfortunately, this meant that the autos data set was extremely skewed towards search terms that related to the Kelley and Edmunds auto guides, which are very popular among users.Representation of Search TermsFor each search term, I needed a set of attributes to describe it. I considered using either the contents or the editorial descriptions of the first twenty web site hits in the Yahoo Directory as the representation of the search term. However, there were two problems with this approach: First, the number of web sites in the Yahoo Directory numbers in the millions whereas the number of documents on the web is believed to be in the billions. Second, the descriptions of the web sites in the Yahoo Directory have been edited to be consistent and free from spelling and grammatical errors, which prevented me from obtaining a representation for misspelled search terms and terms that are expressed differently from what the Yahoo editors had chosen. For example, a search for “jamica” (misspelling of “jamaica”) returns 0 web site results from the Yahoo Directory. Instead, I chose to use the contents of the first twenty web page matches from Google (which claims to index over 2 billion documents) as the representation for each search term.Parsing Web PagesI took the following approach to extracting the words from a web page:1. Delete all <script>…</script> and <style>…</style> blocks2. Strip all HTML escape sequences3. Strip all HTML tags4. Strip all non-alphanumeric characters5. Lowercase all characters6. The remaining sequence of word tokens is the representation for the web pageAdditionally, the word tokens were stemmed using Porter’s Stemming Algorithm [4]. Haveliwala, et al have shown that stemming using this algorithm improves the performance of the similarity metric that we will be using in this project [2].There are several major challenges to determining the contents of web pages:- Non-understanding of HTML tags: Since I had stripped away all tags from the document, I did not know whether two words that were collated in the resulting sequence would actually be collocated when the original document was rendered in abrowser.- Page Not Found / This Document Has Moved: When page is no longer available, the web server is supposed to return a 404 response via the HTTP protocol, and whena page has moved, the server is supposed to return a 302 response. Our routine to fetch web pages knows how to correctly deal with these situations. However, many web sites will either redirect your request to their home page or will display an error message but will still return a 200 (OK) response. This causes the contents of unintended pages to be used in our similarity metric.- Splash Pages: Some web sites have a splash page that contains no content, and refreshes after some delay to send you to the real content. To solve this problem, we would need to detect the <meta> refresh tag and fetch the contents of the next page.- Frames: Pages that contain frames themselves contain no content. A solution to this problem could be to fetch all the sub-frames and combine them to form the documentthat we save.- All Graphics/Flash: Some web pages contain nothing but graphics or Macromedia Flash objects, so there is no text to extract from them.- JavaScript/DHTML: Some web pages are dynamically generated from JavaScript code. Without being able to parse and execute JavaScript, we are unable to deduce the contents of the page.Despite these challenges, I have found that these aberrations are a minority in the data set.Similarity MetricI used the metric introduced by Broder, et. al. [1] to determine the similarity of two documents. Each document is represented as a sequence of word tokens. A contiguous subsequence of tokens in the document is called a shingle. We define the n-shingling S(D, n) as the set of all shingles of length n in document D.For example, if a document D is the following sequence:D = (a, rose, is, a, rose, is, a, rose)Then the 4-shingling of D is the following set:S(D,4) = { (a, rose, is, a), (rose, is, a, rose), (is, a, rose, is) }For a given shingle size, the resemblance r of two documents A and B is defined as:Hence, the resemblance scales from 0 to 1, with 0 being no resemblance and 1 being perfect resemblance. Note that a document A always resembles itself 100%: r(A,A) = 1. However, resemblance is not transitive. It is possible for r(A,B) = 50% and r(B,C) = 50%and r(A,C) = 0%.Note: This entire explanation was shamelessly stolen from [1].Stop Words and Stop ShinglesTo decrease the memory requirement and to remove noise from the data set, the most commonly occurring words (the stop words) were removed the web pages prior to shingling


View Full Document

Stanford CS 224 - Inside Yahoo generator

Documents in this Course
Load more
Download Inside Yahoo generator
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Inside Yahoo generator and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Inside Yahoo generator 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?