Unformatted text preview:

Web Search EnginesSearch Engine CharacteristicsWeb Search QueriesDirectories vs. Search EnginesWhat about Ranking?Relevance: Going Beyond IRWeb Search ArchitectureStandard Web Search Engine ArchitectureInverted Indexes the IR WayHow Inverted Files Are CreatedHow Inverted Files are CreatedHow Inverted Files are CreatedSlide 13Slide 14Inverted indexesInverted Indexes for Web Search EnginesFrom description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htmCascading Allocation of CPUsWeb CrawlingWeb CrawlersWeb Crawling AlgorithmWeb Crawling IssuesGoogle: A Case StudyGoogle’s IndexingGoogle’s Inverted IndexSlide 26Link Analysis for Ranking PagesSlide 28PageRankPageRank: User ModelWeb Search StatisticsSearches per DayWeb Search Engine VisitsPercentage of web users who visit the site shownSearch Engine Size (July 2000)Does size matter? You can’t access many hits anyhow.Increasing numbers of indexed pages, self-reportedWeb CoverageSlide 39Directory sizesDatabase Management Systems, R. Ramakrishnan 1Web Search EnginesChapter 27, Part CBased on Larson and Hearst’s slides at UC-Berkeleyhttp:// www.sims.berkeley.edu/courses/is202/f00/Database Management Systems, R. Ramakrishnan 2Search Engine CharacteristicsUnedited – anyone can enter content•Quality issues; SpamVaried information types•Phone book, brochures, catalogs, dissertations, news reports, weather, all in one place!Different kinds of users•Lexis-Nexis: Paying, professional searchers•Online catalogs: Scholars searching scholarly literature•Web: Every type of person with every type of goalScale•Hundreds of millions of searches/day; billions of docsDatabase Management Systems, R. Ramakrishnan 3Web Search QueriesWeb search queries are short:•~2.4 words on average (Aug 2000)•Has increased, was 1.7 (~1997)User Expectations:•Many say “The first item shown should be what I want to see!”•This works if the user has the most popular/common notion in mind, not otherwise.Database Management Systems, R. Ramakrishnan 4Directories vs. Search EnginesDirectories•Hand-selected sites•Search over the contents of the descriptions of the pages•Organized in advance into categoriesSearch Engines•All pages in all sites •Search over the contents of the pages themselves•Organized in response to a query by relevance rankings or other scoresDatabase Management Systems, R. Ramakrishnan 5What about Ranking?Lots of variation here•Often messy; details proprietary and fluctuatingCombining subsets of:•IR-style relevance: Based on term frequencies, proximities, position (e.g., in title), font, etc. •Popularity information •Link analysis informationMost use a variant of vector space ranking to combine these. Here’s how it might work:•Make a vector of weights for each feature•Multiply this by the counts for each featureDatabase Management Systems, R. Ramakrishnan 6Relevance: Going Beyond IRPage “popularity” (e.g., DirectHit)•Frequently visited pages (in general)•Frequently visited pages as a result of a queryLink “co-citation” (e.g., Google)•Which sites are linked to by other sites?•Draws upon sociology research on bibliographic citations to identify “authoritative sources”•Discussed further in Google case studyDatabase Management Systems, R. Ramakrishnan 7Web Search ArchitectureDatabase Management Systems, R. Ramakrishnan 8Standard Web Search Engine Architecturecrawl thewebcreate an invertedindexCheck for duplicates,store the documentsInverted indexSearch engine serversuserqueryShow results To userDocIdsDatabase Management Systems, R. Ramakrishnan 9Inverted Indexes the IR WayDatabase Management Systems, R. Ramakrishnan 10How Inverted Files Are CreatedPeriodically rebuilt, static otherwise.Documents are parsed to extract tokens. These are saved with the Document ID.Now is the timefor all good mento come to the aidof their countryDoc 1It was a dark andstormy night in the country manor. The time was past midnightDoc 2Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2Database Management Systems, R. Ramakrishnan 11How Inverted Files are CreatedAfter all documents have been parsed the inverted file is sorted alphabetically.Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2Database Management Systems, R. Ramakrishnan 12How InvertedFiles are CreatedMultiple term entries for a single document are merged.Within-document term frequency information is compiled.Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2Database Management Systems, R. Ramakrishnan 13How Inverted Files are CreatedFinally, the file can be split into •A Dictionary or Lexicon file and •A Postings fileDatabase Management Systems, R. Ramakrishnan 14How Inverted Files are CreatedDictionary/Lexicon PostingsTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2Database Management Systems, R. Ramakrishnan 15Inverted indexesPermit fast search for individual termsFor each term, you get a list consisting of:•document ID •frequency of term in doc (optional)


View Full Document

Mt Holyoke CS 341 - Web Search Engines

Download Web Search Engines
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Web Search Engines and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Web Search Engines 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?