Siena CSIS 116 - Searching on the WWW The Google Phenomena

Unformatted text preview:

Searching on the WWW The Google PhenomenaSearchingPowerPoint PresentationSlide 4Organizing InformationProblem With HierarchiesWebs or NetworksProblem With Web NetworksSlide 9Search Engines to the RescueHow Search Engines WorkSlide 12Step 1: Web CrawlingStep 2: Web IndexingSlide 15Step 3: User SearchStep 4: Ranking the resultsSearch Engine IssuesSearch Engines (Catch 22)How Google became the bestPageRank continuedPageRank intuitionSlide 23Searching on the WWWThe Google PhenomenaSnyder p119-141SearchingThe best place to look for something is where it’s likely to be foundKey to finding information.SearchingA lot of information can be found on the WWWBut, that is the flaw of the WWW:Too much InformationNo organizationNo structureOrganizing InformationClassificationHierarchyCategories with sub-categoriesWhat are some problems with Hierarchies?Problem With HierarchiesI want to find the requirements for the Minor in ISSiena CollegeAcademics Financial Aid AthleticsDegree RequirementsFormsSchoolsScienceCSArtsIS MinorWebs or NetworksMultiple PathsSiena CollegeAcademics Financial Aid AthleticsDegree RequirementsFormsSchoolsScienceCSArtsIS MinorProblem With Web NetworksLess Important to Get things in the correct categoryInformation architects don’t worry too much aboutClassificationOrganizationProblem With Web NetworksAs different Networks of Information are connectedExcessive redundant links emergeDifferent organization strategies classA mess is createdSearch Engines to the RescueAlternative to searching via navigationhttp://www.marketwatch.com/newsimages/misc/search_engines_timeline.pdfHow Search Engines Work1. Web Crawling – program (robot) surfs from hyper-link to hyper-link accumulating pages2. Web Indexed – each accumulated page is added to a database.URL of web page is storedEach word, occurrences, and sometimes position are stored.How Search Engines Work3. User Search – actually searches the index database4. Sophisticated Algorithms are used to retrieve and rank pages that “match” the user search.Step 1: Web CrawlingHardest task.5 - 10 million new web pages are added to the Internet every day.Robots need to know where to start lookingYou need the help and cooperation of web page creators.Step 2: Web IndexingEach URL consists of a list of words...URL1  word5  word74  word195  word456URL2  word7  word82  word135URL3  word5  word74  word165  word288URL4  word21  word59 URL5  word25  word74  word188  word432URL6  word7  word186  word430URL7  word2  word398 URL8  word34  word39  word84  word193...Step 2: Web IndexingInverted Index: Each word consists of a list of URLsword1  URL19  URL39  URL82  URL91word2  URL27  URL41  URL66  URL67word3  URL49  URL75  URL65word4  URL29  URL89word5  URL12  URL48  URL66word6  URL53  URL73  URL123  URL144word7  URL3  URL41  URL77...Step 3: User SearchSearching the index database must be quick.The database is sorted by key words (primary index)The English language has about 600,000 wordsLuckily, only about a tenth them are widely usedThe database server needs to store the primary index in memory (RAM).Step 4: Ranking the results Searches on common words can return millions of pages.Ordering or ranking becomes more important as the data increasesIntuitive measuresNumber of occurances of search wordsSearch word in title, keyword, etc“Importance” of web pageUser feedback.Search Engine IssuesLogical statements AND, OR, etc.Phrases “Grilled Cheese”Images – Dali ExampleDishonesty – XXX ExampleDifferences in Vocabulary - IBM-IssueSearch Engines (Catch 22)Search Engine Companies make money by placing ads.More searches = bigger audience = more $$$ from adsBest Thing: Get as many people to use your search engine as possibleWorst Thing: What if everyone exclusively uses search engines to search the WWW?How Google became the bestPageRank algorithm (based on the Clever Algorithm)PageRank is a measure of importance.Links from important pages improves your PageRank1.2 1.62.4 3.1 2.0 4.63.812.1PageRank continuedhttp://www.iprcom.com/papers/pagerank/Simplistic Explanation:Initially all pages have the same PageRankAn iterative process increases the page rank of all pages based on direct links first (highly weighted *1)then, one hop linksthen, two hop links... then, ten hop links (very low weighting *0.001)...The algorithm ignores cyclesThe algorithm does not reward cliquesEventually, the page ranks will stabilize (stop increasing) once you’ve considered Until the page ranks stablizePageRank intuitionESPN.com is highly ranked becauseSeveral other highly ranked pages point to itMillions of low ranked pages point to itAny page connected-to or part-of ESPN.com will benefit from this.Intuitively, ESPN is an information authority on sports.PageRank intuitionBreimer.net is poorly ranked becauseVery few pages point to it. None to be exact.The page is not an authority on “Breimer” until other pages acknowledge its existence via a


View Full Document

Siena CSIS 116 - Searching on the WWW The Google Phenomena

Download Searching on the WWW The Google Phenomena
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Searching on the WWW The Google Phenomena and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Searching on the WWW The Google Phenomena 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?