Searching on the WWW The Google PhenomenaSearchingPowerPoint PresentationSlide 4Organizing InformationProblem With HierarchiesWebs or NetworksProblem With Web NetworksSlide 9Search Engines to the RescueHow Search Engines WorkSlide 12Step 1: Web CrawlingStep 2: Web IndexingSlide 15Step 3: User SearchStep 4: Ranking the resultsSearch Engine IssuesSearch Engines (Catch 22)How Google became the bestPageRank continuedPageRank intuitionSlide 23Searching on the WWWThe Google PhenomenaSnyder p119-141SearchingThe best place to look for something is where it’s likely to be foundKey to finding information.SearchingA lot of information can be found on the WWWBut, that is the flaw of the WWW:Too much InformationNo organizationNo structureOrganizing InformationClassificationHierarchyCategories with sub-categoriesWhat are some problems with Hierarchies?Problem With HierarchiesI want to find the requirements for the Minor in ISSiena CollegeAcademics Financial Aid AthleticsDegree RequirementsFormsSchoolsScienceCSArtsIS MinorWebs or NetworksMultiple PathsSiena CollegeAcademics Financial Aid AthleticsDegree RequirementsFormsSchoolsScienceCSArtsIS MinorProblem With Web NetworksLess Important to Get things in the correct categoryInformation architects don’t worry too much aboutClassificationOrganizationProblem With Web NetworksAs different Networks of Information are connectedExcessive redundant links emergeDifferent organization strategies classA mess is createdSearch Engines to the RescueAlternative to searching via navigationhttp://www.marketwatch.com/newsimages/misc/search_engines_timeline.pdfHow Search Engines Work1. Web Crawling – program (robot) surfs from hyper-link to hyper-link accumulating pages2. Web Indexed – each accumulated page is added to a database.URL of web page is storedEach word, occurrences, and sometimes position are stored.How Search Engines Work3. User Search – actually searches the index database4. Sophisticated Algorithms are used to retrieve and rank pages that “match” the user search.Step 1: Web CrawlingHardest task.5 - 10 million new web pages are added to the Internet every day.Robots need to know where to start lookingYou need the help and cooperation of web page creators.Step 2: Web IndexingEach URL consists of a list of words...URL1 word5 word74 word195 word456URL2 word7 word82 word135URL3 word5 word74 word165 word288URL4 word21 word59 URL5 word25 word74 word188 word432URL6 word7 word186 word430URL7 word2 word398 URL8 word34 word39 word84 word193...Step 2: Web IndexingInverted Index: Each word consists of a list of URLsword1 URL19 URL39 URL82 URL91word2 URL27 URL41 URL66 URL67word3 URL49 URL75 URL65word4 URL29 URL89word5 URL12 URL48 URL66word6 URL53 URL73 URL123 URL144word7 URL3 URL41 URL77...Step 3: User SearchSearching the index database must be quick.The database is sorted by key words (primary index)The English language has about 600,000 wordsLuckily, only about a tenth them are widely usedThe database server needs to store the primary index in memory (RAM).Step 4: Ranking the results Searches on common words can return millions of pages.Ordering or ranking becomes more important as the data increasesIntuitive measuresNumber of occurances of search wordsSearch word in title, keyword, etc“Importance” of web pageUser feedback.Search Engine IssuesLogical statements AND, OR, etc.Phrases “Grilled Cheese”Images – Dali ExampleDishonesty – XXX ExampleDifferences in Vocabulary - IBM-IssueSearch Engines (Catch 22)Search Engine Companies make money by placing ads.More searches = bigger audience = more $$$ from adsBest Thing: Get as many people to use your search engine as possibleWorst Thing: What if everyone exclusively uses search engines to search the WWW?How Google became the bestPageRank algorithm (based on the Clever Algorithm)PageRank is a measure of importance.Links from important pages improves your PageRank1.2 1.62.4 3.1 2.0 4.63.812.1PageRank continuedhttp://www.iprcom.com/papers/pagerank/Simplistic Explanation:Initially all pages have the same PageRankAn iterative process increases the page rank of all pages based on direct links first (highly weighted *1)then, one hop linksthen, two hop links... then, ten hop links (very low weighting *0.001)...The algorithm ignores cyclesThe algorithm does not reward cliquesEventually, the page ranks will stabilize (stop increasing) once you’ve considered Until the page ranks stablizePageRank intuitionESPN.com is highly ranked becauseSeveral other highly ranked pages point to itMillions of low ranked pages point to itAny page connected-to or part-of ESPN.com will benefit from this.Intuitively, ESPN is an information authority on sports.PageRank intuitionBreimer.net is poorly ranked becauseVery few pages point to it. None to be exact.The page is not an authority on “Breimer” until other pages acknowledge its existence via a
View Full Document