DOC PREVIEW
Princeton COS 116 - Lecture

This preview shows page 1-2-3-26-27-28 out of 28 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Seek and Ye shall FindRecap: Binary RepresentationMisconceptions about ComputersVarious meanings ofThese are major scientific problems with many componentsPowerPoint PresentationLooking up “Shirley Tilghman” in Electronic PhonebookRest of the lecture: Web SearchFuture lecture: Internet (physical infrastructure underlying Web)Slide 10Logical Structure of the Web1st step for search engines: create snapshot of the webFirst Web CrawlerStill Feasible Today?Searching for “computer music”Some pitfallsSolutionCLEVERBreaking CircularityScore CalculationSlide 21Slide 22ConcernsSlide 24Trends in web searchShape of things to come:Next Time…Slide 28Seek and Ye shall FindCOS 116: 2/21/2008Sanjeev AroraThe continuum of computer “intelligence”Recap: Binary RepresentationPowers of 2202122232425262728292101 2 4 8 16 32 64 128 256 512 1024210 = 1024 ≈ 103Fact: Every integer can be uniquely represented as a sum of powers of 2.Ex: 25 = 16 + 8 + 1 = 1 x 24 + 1 x 23 + 0 x 22 + 0 x 21 + 1 x 20 [25]2 = 11001Misconceptions about ComputersJust a calculatoron steroids Just maintains large amount of data Just does what the programmer tells itYes, but …Weather ForecastAirline Reservation SystemVarious meanings ofLook up “Shirley Tilghman” in online phonebook.In consumer database, find “credit-worthy” consumers.Find web pages relevant to “computer music.”Among all cell phone conversations originating in Country X, identify suspicious ones.Search all religion and philosophy books of the world for meaning of life.“Data Mining” “Web Search”These are major scientific problems with many componentsEngineeringAlgorithmsStatistical ModelingEthics, Policy, SocietyLinguisticsDiscussion TimeHow do you solve this task: Sorted array of n numbers, find if it contains 58780Binary search! First thing to check: “Is A[n/2] <58780”?(Whatever the answer, you halve the range.)Question: What if the array of numbers is not sorted??Looking up “Shirley Tilghman” in Electronic PhonebookASCII: Agreed-upon convention for representing letters with numbersExample:Sorted Phonebook = sorted array of numbersUse binary search (prev. slide)T i l g h m a n , 2 5 8 - 6 1 0 084 105 108 103 104 109 97 110 44 50 53 56 45 54 49 48 48Ideas??Rest of the lecture: Web SearchFuture lecture: Internet(physical infrastructure underlying Web)Routers, gateways, DNS, ...(any computer can send amsg to any other)What is World Wide Web?Files residing on “servers” that are connected to internet.A file “index.html” in “public_html” directoryon some server belongingto PU.URL (uniform resource locator); basically an“address”“hyperlinks”:URL of other files;could be on another server.Logical Structure of the WebImportant: This logical structure is created by independent actions of 100s of millions of users“Directed graph”“edges” = link from one node to another1st step for search engines: create snapshot of the webWebcrawler: “browser on autopilot”-Maintains array of web pages it has seen-2 types of pages: “visited”, “fully explored”-Do forever{ Pick any webpage marked “visited” from array. Mark it “fully explored.” Open all its linked pages in browser. Save them in array and mark them “visited.” }Better: just the pages not “fully explored” yet.First Web CrawlerFrom: [email protected] (Brian Pinkerton)Newsgroups: comp.infosystems.announceSubject: The WebCrawler Index: A content-based Web indexDate: 11 June 1994 21:33:42 GMTOrganization: University of WashingtonThe WebCrawler Index is now available for searching! The index is broad:it contains information from as many different servers as possible. It'sa great tool for locating several different starting points for exploringby hand. The current index is based on the contents of documents locatedon nearly 4000 servers, world-wide. Check it out at: http://www.biotech.washington.edu/WebCrawler/WebQuery.html Other information is available from there, including a description of theWebCrawler (the robot itself), and a list of the 25 most frequentlyreferenced sites on the Web. Brian PinkertonDept of Computer Science and EngineeringUniversity of Washington[http://thinkpink.com/bp/WebCrawler/History.html]Still Feasible Today?About 15 billion web pages today (could be off by 2x).Say 10 kb (10,000 bytes) of data per page15 X 1013 bytes to store the web≈ 150, 000 Gb≈ 500 hard disks≈ $50,000 in ‘07Searching for “computer music”Ideas?Identify all pages that contain “computer music”.Sort according to number of occurrences of “computer music” in the page.Human staff computes answers to all possible questions.Some pitfalls“Spamming” by unscrupulous websites Synonymy (car, auto, vehicle …) Polysemy (jaguar: car or cat?)SolutionIBM’s CLEVER – 1996 Google’s PAGERANK – 1997Take advantage of the link structure of the webWeb link confers “approval”CLEVERTypically Authorities point to hubs and hubs point to authoritiesHubs: Clearinghouses of information- “My favorite computer music links”Authorities: Sites that are viewed “with respect” by many- New York Times- International Computer Music AssociationCircular Definition?Circular Definition – see Definition, CircularBreaking CircularityIterative algorithmStart withAt every step each page has:“Hub Score”“Authority Score”Pages containing “Computer music”All pages they point to}Initially all 1Score Calculation-Do forever{ Next Hub Score for page Next Authority Score for page }Sum of current Authority Scores of pages that link to it.Sum of current Hub Scores of pages that link to it.Fact The scores converge.(Proof uses Linear Algebra, Eigenvalues)Computer models and jurisprudenceAug 25th 2005 [Fowler and Jeon, ’05]- By product of CLEVER algorithm– it reveals clustersExample:Pro-ChoicePro-Life“Abortion” - Data Mining – Process of finding answers that are not in the data and must be inferred.Example: “How is a person who shops at Whole Foods & REI likely to vote?”ConcernsFrom users: - Privacy- Privacy- PrivacyFrom Computer scientists:- Formalize privacy- How to safeguard privacy while allowing legitimate computations“Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences” (top prize: $1M)Trends in web


View Full Document

Princeton COS 116 - Lecture

Documents in this Course
Lecture 5

Lecture 5

15 pages

lecture 7

lecture 7

22 pages

Lecture

Lecture

32 pages

Lecture

Lecture

16 pages

Midterm

Midterm

2 pages

Lecture

Lecture

23 pages

Lecture

Lecture

21 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Lecture

Lecture

21 pages

Lecture

Lecture

50 pages

Lecture

Lecture

19 pages

Lecture

Lecture

28 pages

Lecture

Lecture

32 pages

Lecture

Lecture

23 pages

Lecture

Lecture

21 pages

Lecture

Lecture

19 pages

Lecture

Lecture

22 pages

Lecture

Lecture

21 pages

Logic

Logic

20 pages

Lab 7

Lab 7

9 pages

Lecture

Lecture

25 pages

Lecture 2

Lecture 2

25 pages

lecture 8

lecture 8

19 pages

Midterm

Midterm

5 pages

Lecture

Lecture

26 pages

Lecture

Lecture

29 pages

Lecture

Lecture

40 pages

Lecture 3

Lecture 3

37 pages

lecture 3

lecture 3

23 pages

lecture 3

lecture 3

20 pages

Lecture

Lecture

21 pages

Lecture

Lecture

24 pages

Lecture

Lecture

19 pages

Load more
Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?