UCSC FDM 20C - Search engines - D86852

Home> Schools> University of California, Santa Cruz> (FDM) > FDM 20C> Search engines

UCSC FDM 20C - Search engines

School name University of California, Santa Cruz

Course Fdm 20c- Introduction to Digital Media

Pages 20

Download Save

Unformatted text preview:

PowerPoint Presentationoutlinewhat do search engines do?crawling/spidering the webdepth-first crawlbreadth-first crawlindexing web pages found by the crawlertokenizationstop wordsstemmingdocument vectorsinverted indexinverted index / dictionaryinverted index / dictionary / exampleinverted index / postings / exampleretrievalbetter retrievalsocial networks and centralitygoogle’s page rank algorithmthe difference is in the detailssearch engines fdm 20c introduction to digital medialecture 04.12.2008warren sack / film & digital media department / university of california, santa cruzoutline•what’s the difference between a web directory and a search engine?–ontologies–clusters•what does a search engine do?–crawling–indexing–retrievingwhat do search engines do?•crawl the web•index the pages found•retrieve pages from the index in response to user queriescrawling/spidering the webdepth-first crawl source: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/SearchAnimations.htmlbreadth-first crawl source: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/SearchAnimations.htmlindexing web pages found by the crawler1. tokenization: break the document into words, punctuation, tags, and uris2. stop words: remove any that are too frequently used to be informational (these are called stop words)3. stemming: find the stem of each word4. inverted index: create an index linking each word found to one or more documentstokenization•T’was was a dark and stormy night in the country manor. ==>•T single-quote was a dark and stormy night in the country manor periodstop words•T single-quote was a dark and stormy night in the country manor period==>•T single-quote was a dark and stormy night in the country manor periodstemming•dark stormy night country manor==>•dark stormy night country manordocument vectors•documents (i.e., web pages) are stored as vectors or lists; e.g., country dark manor night storm•note that syntactic structure is lostinverted index•dictionary: make a file listing all of the words found in all of the web pages crawled•postings: for each word, make a list of all of the web pages in which the word was found•encoding: remove any redundancies found in the dictionary and postings files so that they are as small as possibleinverted index / dictionary•create a document vector for each web page•combine all of the vectors together•remove any duplicate words•alphabetize the resulting listinverted index / dictionary / example•webpage 1: T’was a dark and stormy night in the country manor.==> country dark manor night storm•webpage 2: Now is the time for all good men to come to the aid of their country.==> aid all come country good men time •dictionary: aid all come country dark good manor men night storm timeinverted index / postings / example•postings: aid (2) all (2) come (2) country (1&2) dark (1) good (2) manor (1) men (2) night (1) storm (1) time (2)•the postings file just annotates the dictionary file with the documents in which the words appearretrieval•the postings file can support boolean queries•for example, searching for webpages that contain the word “country” returns 1&2.•for example, searching for the webpages that contain the words “country” and “manor” returns webpage 1better retrieval•how should the webpages retrieved be ordered?•e.g., if webpage 1 and webpage 2 both contain the word “country” should 1 be listed before or after 2?social networks and centrality Diane is central; Jane is not. See www.orgnet.com/sna.htmlgoogle’s page rank algorithm•basic idea: if a webpage has a lot of other webpages that link to it, then it is central and probably an important page.•If page A has pages T1...Tn which link to it (i.e., are citations) and C(A) is defined as the number of links going out of page A, then PageRank (PR) of a page A is given as follows:•PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) –where the parameter d is a damping factor which can be set between 0 and 1 and is usually set to 0.85.the difference is in the details•The Boolean approach to information retrieval is the most widely used. Links to documents are returned only if they contain exactly the same words as your query. Even if a site is related in some way, it would never be retrieved unless it contains the user's search string. In order to get better results from this, most search engines finesse things a little. The results from the inverted file are modified by factors such as: where the search term appears in the item - whether the string of characters appears in specially tagged areas such as titles or head, or whether it appears early in the body text of the html document; how close the search words are together, or whether they form an 'exact phrase'; the frequency of occurrence of search terms in the inverted file; whether the search terms appear as keywords in the metatags in the html file's head; how frequently the terms appear in relation to the document's length (greater frequency indicating a probable greater relevance). These techniques gradually move over into semantic analysis. For instance selection by the frequency of occurrence of words in general in which very uncommon words get heavier weight. Matthew

View Full Document


School:
Email:
New Password:
Confirm Password:

UCSC FDM 20C - Search engines

Sign up for free to view:

Please select your school