Seek and Ye shall FindRecap: Binary RepresentationMisconceptions about ComputersVarious meanings ofThese are major scientific problems with many componentsElectronic PhonebookRest of the lecture: Web SearchWorld Wide Web (simplified view)Future lecture: Physical infrastructure of the WebLogical Structure of the Web1st step for search engines: create snapshot of the webFeasibility CalculationSearching for “Computer Music”Some pitfallsSolutionCLEVERBreaking CircularityScore CalculationConcernsQs for next time: What is computation? What can computers not do?Seek and Ye shall Find2/28/2006COS 116Instructor: Sanjeev AroraThe continuum of computer “intelligence”Recap: Binary RepresentationPowers of 2202122232425262728292101 2 4 8 16 32 64 128 256 512 1024210 = 1024 ≈ 103Fact: Every integer can be uniquely represented as a sum of powers of 2.Ex:25 = 16 + 8 + 1 = 1 × 24+ 1 × 23 + 0 × 22 + 0 × 21 + 1 × 20[25]2 = 11001Misconceptions about ComputersJust a calculator on steroidsJust maintains large amount of dataJust does what programmer tells itYes, but …Weather ForecastAirline Reservation SystemVarious meanings of Look up “Shirley Tilghman” in online phonebook. In consumer database, find “credit-worthy”consumers. Find web pages relevant to “computer music.” Among all cell phone conversations originating in Country X, identify suspicious ones. Search all religion and philosophy books of the world for meaning of life.These are major scientific problems with many componentsEngineeringAlgorithmsStatistical ModelingEthics, Policy, SocietyLinguisticsElectronic Phonebook ASCII: Agreed-upon convention for representing letters with numbers Example: Sorted Phonebook = sorted array of numbers Use binary searchT i l ghman , 258-610084 105 108 103 104 109 97 110 44 50 53 56 45 54 49 48 48Rest of the lecture: Web SearchWorld Wide Web (simplified view)URL: Unique address for each document BrowserWeb PageHyperlinkFuture lecture: Physical infrastructure of the WebRouters, gateways, DNS, etc.Logical Structure of the Web Important: This logical structure is created by independent actions of 100s of millions of users“Directed graph”“edges” = link from one node to another1st step for search engines: create snapshot of the web Webcrawler: Browser on autopilot- Maintains array of web pages it has seen- 2 types of pages: “visited”, “fully explored”- Do forever{Pick any webpage marked “visited” from array.Mark it “fully explored.”Open all its linked pages in browser.Save them in array and mark them “visited.”}Feasibility Calculation About 15 billion web pages today. Say 10 Kilobytes (10,000 bytes) of data per page 15 X 1013bytes to store the web ≈ 150, 000 Gb ≈ 500 Hard Disks (about $150,000)Searching for “Computer Music”Ideas? Identify all pages that contain “Computer Music.” Sort according to number of occurrences of “computer music” in the page. Human staff computes answers to all possible questions.Some pitfalls “Spamming” by unscrupulous websites Synonymy PolysemySolutionIBM’s CLEVER – 1996 Google’s PAGERANK – 1997Take advantage of the link structure of the webWeb link confers “approval”CLEVERTypically Authorities point to hubs and hubs point to authoritiesHubs: Clearinghouses of information- “My favorite computer music links”Authorities: Sites that are viewed “with respect” by many-New York Times- International Computer Music AssociationCircular Definition?Circular Definition – see Definition, CircularBreaking Circularity Iterative algorithm Start with At every step each page has: “Hub Score” “Authority Score”Pages containing “Computer music”All pages they point to}Initially all 1Score Calculation- Do forever{Next Hub Score for page Next Authority Score for page}Sum of current Authority Scores of pages that link to it.Sum of current Hub Scores of pages that link to it.Fact The scores converge.(Proof uses Linear Algebra, Eigenvalues)- By Product – Algorithm reveals clustersExample:Pro-ChoicePro-Life“Abortion”- Data Mining – Process of finding answers that are not in the data and must be inferred.Example: “How is a person who shops at Whole Foods & REI likely to vote?”Computer models and jurisprudenceAug 25th 2005 [Fowler and Jeon, ’05]ConcernsFrom users: -Privacy-Privacy-PrivacyFrom Computer scientists:- Formalize privacy- How to safeguard privacy while allowing legitimate computationsQs for next time: What is computation?What can computers not do?Also, l0-min discussion of readings for today’s
View Full Document