Stanford CS 276 - Lecture 15 - Web Search Basis

Unformatted text preview:

Introduc)on*to*Informa)on*Retrieval*Introduc)on*to*Informa(on)Retrieval)CS276*Informa)on*Retrieval*and*Web*Search*Pandu*Nayak*and*Prabhakar*Raghavan*Lecture*15:*Web*search*basics*Introduc)on*to*Informa)on*Retrieval*Brief*(non‐technical)*histor y* Early*keyword‐based*engines*ca.*1995‐1997* Altavista,*Excite,*Infoseek,*Inktomi,*Lycos* Paid*search*ranking:*Goto*(morphed*into*Overture.com*→*Yahoo!)* Your*search*ranking*depended*on*how*much*you*paid* Auc)on*for*keywords:*casino*was*expensive!*2*Introduc)on*to*Informa)on*Retrieval*Brief*(non‐technical)*histor y* 1998+:*Link‐based*ranking*pioneered*by*Google* Blew*away*all*early*engines*save*Inktomi* Great*user*experience*in*search*of*a*business*model* Meanwhile*Goto/Overture’s*annual*revenues*were*nearing*$1*billion* Result:*Google*added*paid*search*“ads”*to*the*side,*independent*of*search*results* Yahoo*followed*suit,*acquiring*Overture*(for*paid*placement)*and*Inktomi*(for*search)* 2005+:*Google*gains*search*share,*domina)ng*in*Europe*and*very*strong*in*North*America* 2009:*Yahoo!*and*Microso_*propose*combined*paid*search*offering*3*Introduc)on*to*Informa)on*Retrieval*Algorithmic results.!Paid!Search Ads!4*Introduc)on*to*Informa)on*Retrieval*Web*search*basics*The Web Ad indexes Web spider Indexer Indexes Search User Sec. 19.4.1 5*Introduc)on*to*Informa)on*Retrieval*User*Needs* Need*[Brod02,*RL04]* Informa(onal*–*want*to*learn*about*something*(~40%*/*65%)* Naviga(onal*–*want*to*go*to*that*page*(~25%*/*15%)* Transac(onal*–*want*to*do*something*(web‐mediated)*(~35%*/*20%)* Access*a**service* Downloads** Shop* Gray)areas) Find*a*good*hub* Exploratory*search*“see*what’s*there”**Sec. 19.4.1 6*Introduc)on*to*Informa)on*Retrieval*How*far*do*people*look*for*results?*(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) 7*Introduc)on*to*Informa)on*Retrieval*Users’*empirical*evalua)on*of*results* Quality*of*pages*varies*widely* Relevance*is*not*enough* Other*desirable*quali)es*(non*IR!!)* Content:*Trustworthy,*diverse,*non‐duplicated,*well*maintained* Web*readability:*display*correctly*&*fast* No*annoyances:*pop‐ups,*etc.* Precision*vs.*recall* On*the*web,*recall*seldom*mapers* What*mapers* Precision*at*1?*Precision*above*the*fold?* Comprehensiveness*–*must*be*able*to*deal*with*obscure*queries* Recall*mapers*when*the*number*of*matches*is*very*small* User*percep)ons*may*be*unscien)fic,*but*are*significant*over*a*large*aggregate*8*Introduc)on*to*Informa)on*Retrieval*Users’*empirical*evalua)on*of*engines* Relevance*and*validity*of*results* UI*–*Simple,*no*cluper,*error*tolerant* Trust*–*Results*are*objec)ve* Coverage*of*topics*for*polysemic*queries* Pre/Post*process*tools*provided* Mi)gate*user*errors*(auto*spell*check,*search*assist,…)* Explicit:*Search*within*results,*more*like*this,*refine*...* An)cipa)ve:*related*searches* Deal*with*idiosyncrasies* Web*specific*vocabulary* Impact*on*stemming,*spell‐check,*etc.* Web*addresses*typed*in*the*search*box* “The*first,*the*last,*the*best*and*the*worst*…”*9*Introduc)on*to*Informa)on*Retrieval*The*Web*document*collec)on* No*design/co‐ordina)on* Distributed*content*crea)on,*linking,*democra)za)on*of*publishing* Content*includes*truth,*lies,*obsolete*informa)on,*contradic)ons*…** Unstructured*(text,*html,*…),*semi‐structured*(XML ,*annotated*photos),*structured*(Databases)…* Scale*much*larger*than*previous*text*collec)ons*…*but*corporate*records*are*catching*up* Growth*–*slowed*down*from*ini)al*“volume*doubling*every*few*months”*but*s)ll*expanding* Content*can*be*dynamically*generated*The Web Sec. 19.2 10*Introduc)on*to*Informa)on*Retrieval*SPAM)(SEARCH)ENGINE)OPTIMIZATION))11*Introduc)on*to*Informa)on*Retrieval*The*trouble*with*paid*search*ads*…* It*costs*money.**What’s*the*alterna)ve?* Search*Engine*Op)miza)on:* “Tuning”*your*web*page*to*rank*highly*in*the*algorithmic*search*results*for*select*keywords* Alterna)ve*to*paying*for*placement* Thus,*intrinsically*a*marke)ng*func)on* Performed*by*companies,*webmasters*and*consultants*(“Search*engine*op)mizers”)*for*their*clients* Some*perfectly*legi)mate,*some*very*shady*Sec. 19.2.2 12*Introduc)on*to*Informa)on*Retrieval*Search*engine*op)miza)on*(Spam)* Mo)ves* Commercial,*poli)cal,*religious,*lobbies* Promo)on*funded*by*adver)sing*budget* Operators* Contractors*(Search*Engine*Op)mizers)*for*lobbies,*companies* Web*masters* Hos)ng*services* Forums* E.g.,*Web*master*world*(*www.webmasterworld.com*)* Search*engine*specific*tricks** Discussions*about*academic*papers***Sec. 19.2.2 13*Introduc)on*to*Informa)on*Retrieval*Simplest*forms* First*genera)on*engines*relied*heavily*on*</idf** The*top‐ranked*pages*for*the*query*maui resort*were*the*ones*containing*the*most*maui’s*and*resort’s* SEOs*responded*with*dense*repe))ons*of*chosen*terms* e.g.,*maui'resort maui resort maui resort'' O_en,*the*repe))ons*would*be*in*the*same*color*as*the*background*of*the*web*page* Repeated*terms*got*indexed*by*crawlers* But*not*visible*to*humans*on*browsers*Pure word density cannot be trusted as an IR signal Sec. 19.2.2 14*Introduc)on*to*Informa)on*Retrieval*Variants*of*keyword*stuffing* Misleading*meta‐tags,*excessive*repe))on* Hidden*text*with*colors,*style*sheet*tricks,*etc.*Meta-Tags = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …” Sec. 19.2.2 15*Introduc)on*to*Informa)on*Retrieval*Cloaking* Ser ve*fake*content*to*search*engine*spider* DNS*cloaking:*Switch*IP*address.*Impersonate**Is this a Search Engine spider? N Y SPAM Real Doc Cloaking Sec. 19.2.2 16*Introduc)on*to*Informa)on*Retrieval*More*spam*techniques* Doorway)pages) Pages*op)mized*for*a*single*keyword*that*re‐direct*to*the*real*target*page* Link)spamming) Mutual*admira)on*socie)es,*hidden*links,*awards*–*more*on*these*later*


View Full Document

Stanford CS 276 - Lecture 15 - Web Search Basis

Documents in this Course
Load more
Download Lecture 15 - Web Search Basis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 15 - Web Search Basis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 15 - Web Search Basis 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?