Unformatted text preview:

Topics du jour CS347 Lecture 7 April 30 2001 Prabhakar Raghavan Tag position heuristics Increase weights of terms in titles Increase weights of terms in h tags Increase weights of terms near the beginning of the doc its chapters and sections key phrases Finish up web ranking Peer to peer search Search deployment models Service vs software External vs internal facing search software Review of search topics Anchor text Tiger image Here is a great picture of a tiger Cool tiger webpage The text in the vicinity of a hyperlink is descriptive of the page it points to 1 Two uses of anchor text When indexing a page also index the anchor text of links pointing to it To weight links in the hubs authorities algorithm from the last lecture Anchor text usually taken to be a window of 6 8 words around a link anchor Indexing anchor text When indexing a document D include anchor text from links pointing to D Armonk NY based computer giant IBM announced today www ibm com Joe s computer hardware links Compaq HP IBM Indexing anchor text Can sometimes have unexpected side effects e g evil empire Can index anchor text with less weight Big Blue today announced record profits for the quarter Weighting links In hub authority link analysis can match anchor text to query then weight link h x a y h x w x y a y h y a x w x y h y x a x y y x x y y x 2 Weighting links Weighted hub authority computation What is w x y Should increase with the number of query terms in anchor text Say 1 number of query terms x Armonk NY based computer giant IBM announced today www ibm com y Weight of this link for query Computer is 2 Web sites not pages Lots of pages in a site give varying aspects of information on the same topic Recall basic algorithm Iteratively update all h x a x After iteration output pages with highest h scores as top hubs highest a scores as top authorities Now use weights in iteration Raises scores of pages with heavy links Do we still have convergence of scores To what Link neighborhoods Links on a page tend to point to the same topics as neighboring links Break pages down into pagelets say separate by tags and compute a hub authority score for each pagelet Treat portions of web sites as a single entity for score computations 3 Link neighborhoods Ron Fagin s links Logic links Moshe Vardi s logic page International logic symposium Paper on modal logic My favorite football team The 49ers Why the Raiders suck Steve s homepage The NFL homepage Link analysis search summary Powerful new ideas derived from sociology of web content creation Supplemented by other heuristics Less useful in intranets Challenges from dynamic html Application servers and web content management systems Web vs hypertext search The WWW is full of free spirited opinion annotation authority conferral Most other forms of hypertext are far more structured enterprise intranets are regimented and templated very little free form community formation web derived link ranking doesn t quite work Behavior based ranking For each query Q keep track of which docs in the results are clicked on On subsequent requests for Q re order docs in results based on click throughs First due to DirectHit AskJeeves 4 Query doc popularity matrix B j Docs q Queries Bqj number of times doc j clicked through on query q When query q issued again order docs by Bqj values Issues to consider Weighing combining text and click based scores What identifies a query Ferrari Mondial Ferrari Mondial Ferrari mondial ferrari mondial Ferrari Mondial Can use heuristics but search parsing slowed Vector space implementation Maintain a term doc popularity matrix C as opposed to query doc popularity initialized to all zeros Each column represents a doc j If doc j clicked on query q update Cj Cj q here q is viewed as a vector On a query q compute its cosine proximity to Cj for all j Combine this with the regular text score Issues Normalization of Cj after updating Boolean operators Why did the user click on the doc Updating live or batch All votes count the same More on this in recommendation systems 5 Variants Time spent viewing page Difficult session management Inconclusive modeling so far Does user back out of page Does user stop searching Does user transact Which hosts to connect to The ones you connected to last time Random hosts you know of Request suggestions from central or hierarchical nameservers All govern system s shape and efficiency Peer to peer P2P search No central index Each node in a network builds and maintains own index Each node has servent software On booting servent pings 4 other hosts Connects to those that respond Initiates propagates and serves requests Serving P2P search requests Send your request to your neighbors They send it to their neighbors decrement time to live for query query dies when ttl 0 Send search matches back along requesting path 6 The promise of P2P Fresh content no waiting for the next weekly indexing Dynamic content results could be assembled from a database or other repository live pricing inventory information P2P search issues Internet Query interpretation up to servent spamming potential No co ordination in network fragmentation Enterprises security and access control administration distributed replication and caching Search deployment models As a service Search deployment Intranet vs extranet public e g web search access protected e g proprietary newsfeeds and content As software Outward facing Walmart CDNow Inward facing within an enterprise 7 Service deployment issues Ease of maintenance software as well as indices To date not much proprietary content Can tune to platform owners of valuable content don t hand over custody Outward facing search software Relatively small corpora typically under 1GB Sporadic query rates high peak loads Fairly dynamic corpus item prices in a catalog Software deployment Inward vs outward facing very different characteristics corpus sizes query rates languages and localization security content management Typical eCommerce search setup Product database RDBMS w product info prices descriptions Search engine spiders DB indexes structured unstructured product info Application server content assembly personalization Web server Back end inventory RDBMS to complete the transaction 8 Scaling search servers By documents Broker Broker Partitioning the index Broker Each server has a subset of the docs Each has its own dictionary Query sent out to all servers Broker ensures load balancing failover Search servers


View Full Document

Stanford CS 347 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?