Unformatted text preview:

Lecture 7 April 30 2001 Prabhakar Raghavan Finish up web ranking Peer to peer search Search deployment models Service vs software External vs internal facing search software Review of search topics Increase weights of terms in titles Increase weights of terms in h tags Increase weights of terms near the beginning of the doc its chapters and sections key phrases Tiger image Here is a great picture of a tiger Cool tiger webpage The text in the vicinity of a hyperlink is descriptive of the page it points to When indexing a page also index the anchor text of links pointing to it To weight links in the hubs authorities algorithm from the last lecture Anchor text usually taken to be a window of 6 8 words around a link anchor When indexing a document D include anchor text from links pointing to D Armonk NY based computer giant IBM announced today www ibm com Joe s computer hardware links Compaq HP IBM Big Blue today announced record profits for the quarter Can sometimes have unexpected side effects e g evil empire Can index anchor text with less weight In hub authority link analysis can match anchor text to query then weight link h x a y h x w x y a y h y a x w x y h y x a x y y x x y y x What is w x y Should increase with the number of query terms in anchor text Say 1 number of query terms x Armonk NY based computer giant IBM announced today www ibm com Weight of this link for query Computer is 2 y Recall basic algorithm Iteratively update all h x a x After iteration output pages with highest h scores as top hubs highest a scores as top authorities Now use weights in iteration Raises scores of pages with heavy links Do we still have convergence of scores To what Lots of pages in a site give varying aspects of information on the same topic Treat portions of web sites as a single entity for score computations Links on a page tend to point to the same topics as neighboring links Break pages down into pagelets say separate by tags and compute a hub authority score for each pagelet Ron Fagin s links Logic links Moshe Vardi s logic page International logic symposium Paper on modal logic My favorite football team The 49ers Why the Raiders suck Steve s homepage The NFL homepage The WWW is full of free spirited opinion annotation authority conferral Most other forms of hypertext are far more structured enterprise intranets are regimented and templated very little free form community formation web derived link ranking doesn t quite work Powerful new ideas derived from sociology of web content creation Supplemented by other heuristics Less useful in intranets Challenges from dynamic html Application servers and web content management systems For each query Q keep track of which docs in the results are clicked on On subsequent requests for Q re order docs in results based on click throughs First due to DirectHit AskJeeves j Docs q Queries Bqj number of times doc j clicked through on query q When query q issued again order docs by Bqj values Weighing combining text and click based scores What identifies a query Ferrari Mondial Ferrari Mondial Ferrari mondial ferrari mondial Ferrari Mondial Can use heuristics but search parsing slowed Maintain a term doc popularity matrix C as opposed to query doc popularity initialized to all zeros Each column represents a doc j If doc j clicked on query q update Cj Cj q here q is viewed as a vector On a query q compute its cosine proximity to Cj for all j Combine this with the regular text score Normalization of Cj after updating Boolean operators Why did the user click on the doc Updating live or batch All votes count the same More on this in recommendation systems Time spent viewing page Difficult session management Inconclusive modeling so far Does user back out of page Does user stop searching Does user transact No central index Each node in a network builds and maintains own index Each node has servent software On booting servent pings 4 other hosts Connects to those that respond Initiates propagates and serves requests The ones you connected to last time Random hosts you know of Request suggestions from central or hierarchical nameservers All govern system s shape and efficiency Send your request to your neighbors They send it to their neighbors decrement time to live for query query dies when ttl 0 Send search matches back along requesting path Fresh content no waiting for the next weekly indexing Dynamic content results could be assembled from a database or other repository live pricing inventory information Internet Query interpretation up to servent spamming potential No co ordination in network fragmentation Enterprises security and access control administration distributed replication and caching Intranet vs extranet As a service public e g web search access protected e g proprietary newsfeeds and content As software Outward facing Walmart CDNow Inward facing within an enterprise Ease of maintenance software as well as indices Can tune to platform To date not much proprietary content owners of valuable content don t hand over custody Inward vs outward facing very different characteristics corpus sizes query rates languages and localization security content management Relatively small corpora typically under 1GB Sporadic query rates high peak loads Fairly dynamic corpus item prices in a catalog Product database RDBMS w product info prices descriptions Search engine spiders DB indexes structured unstructured product info Application server content assembly personalization Web server Back end inventory RDBMS to complete the transaction Broker Broker Broker Search servers By documents Each server has a subset of the docs Each has its own dictionary Query sent out to all servers Broker ensures load balancing failover By terms Each server has a subset of the lexicon Query sent to server s with the query term s Partition alphabetically easy query dispatch Partition by hashing uniform spread Query optimization is hard Works best when query terms are uniformly spread across servers Search within an intranet Enterprise portals Enterprise doesn t have to be a for profit company government academe any collaborative group with proprietary information Scale lots of docs geographically distributed over non uniform WAN Multiple languages and character sets Locale modules for stemming thesauri Multiple document repositories Lotus Exchange Documentum Filenet Materialized views of compound documents Multiple formats pdf MS office multiple MIME type attachments Each doc


View Full Document

Stanford CS 347 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?