CS347 Lecture 11 May 16 2001 Prabhakar Raghavan Topics Link based clustering Enumerative clustering trawling Recommendation systems Link based clustering Given docs in hypertext cluster into k groups Back to vector spaces Set up as a vector space with axes for terms as well as for in and out neighbors Example 1 4 d 2 3 Vector of terms in d 5 1 2 3 4 5 1 2 3 4 5 1 1 1 0 0 0 0 0 1 1 In links Out links Clustering Given vector space representation run any of the clustering algorithms from lecture 8 Has been implemented on web search results Other corpora patents citation structures Back up In clustering we partition input docs into clusters In trawling we ll enumerate subsets of the corpus that look related will discard lots of docs Twist will use purely link based cues to decide whether docs are related Trawling enumerative clustering In hyperlinked corpora here the web Look for all occurrences of a linkage pattern Recall from hubs authorities search algorithm AT T Alice Sprint Bob MCI Insights from hubs Hub Authority Link based hypothesis Dense bipartite subgraph web community Communities from cores not easy since web is huge what is a dense subgraph define i j core complete bipartite subgraph with i nodes all of which point to each of j others Fans Centers 2 3 core Random graphs inspiration Every large enough dense bipartite graph almost surely has non trivial core e g large 3 by 10 dense 50 edges almost surely 90 chance non trivial 3 by 3 Approach Find all i j cores 3 i 10 3 j 20 Expand each core into its full community Finding cores SQL solution find all triples of pages such that intersection of their outlinks is at least 3 Too expensive Iterative pruning techniques actually work Initial data preprocessing Crawl then extract links Work with potential fans nodes with j non nepotistic links Eliminate mirrors Represent URLs by 2 32 64 bit hash Can sort URL s by either source or destination using disk run sorting Popular page elimination Don t want popular communities Yahoo Excite DejaNews webrings Popular community has popular page s Define popular page indegree 50 Main requirements Main memory conservation Few disk passes over data Simple iterative pruning Discard all pages of in degree i or outdegree j Repeat Why Reduces to a sequence of sorting operations Why on the edge list Elimination generation pruning x a y z a is part of a 3 3 core if and only if the intersection of inlinks of x y and z is at least 3 pick a node a of degree 3 for each a output neighbors x y z use an index on centers to output in links of x y z intersect to decide if a is a fan at each step either eliminate a page a or generate a core Exercise Work through the details of maintaining the index on centers to speed up eliminationgeneration pruning Results after pruning Elimination generation pruning yields 100K nonoverlapping cores for small i j 5M unpruned edges small enough for post processing by a priori build i 1 j cores from i j cores Exercise Adapt the a priori algorithm to enumerating bipartite cores Results for cores Thousands 100 80 60 40 i 3 5 4 6 20 0 3 5 7 9 Number of cores found by Elimination Generation Thousands 80 i 3 60 40 4 20 0 3 5 7 Number of cores found during postprocessing 9 Sample cores hotels in Costa Rica clipart Turkish student associations oil spills off the coast of Japan Australian fire brigades aviation aircraft vendors guitar manufacturers From cores to communities Use hubs authorities algorithm without text query use fans centers as samples Augment core with all pages pointed to by any fan all pages pointing into these all pages pointing into any center all pages pointed to by any of these Using sample hubs authorities Center Fan Costa Rican hotels and travel The Costa Rica Inte ion on arts busi Informatica Interna rvices in Costa Rica Cocos Island Research Center Aero Costa Rica Hotel Tilawa Home Page COSTA RICA BY INTER MERICA tamarindo com Costa Rica New Page 5 The Costa Rica Internet Directory Costa Rica Zarpe Travel and Casa Maria Si Como No Resort Hotels Villas Apartotel El Sesteo de San Jos Cos Spanish Abroad Inc Home Page Costa Rica s Pura V ry Reservation YELLOW RESPALDO HOTELES Orquide1 Costa Rica Summary Profile COST RICA MANUEL A EPOS VILLA Hotels and Travel in Costa Rica Nosara Hotels Res els Restaurants Costa Rica Travel Tourism Resorts Association Civica de Nosara Untitled http www ca hotels mimos html Costa Rica Healthy t Pura Vida Domestic International Airline HOTELES HOTELS COSTA RICA tourgems Hotel Tilawa Links Costa Rica Hotels T On line Reservations Yellow pages Costa Rica Export INFOHUB Costa Rica Travel Guide Hotel Parador Manuel Antonio Costa Rica Destinations Muslim student orgs USC Muslim Students ation Islamic Server The University of O a Domain Name Change Caltech Muslim Students Home Page Islamic Society of Stanford University University of Texas nformation Center CSUN Muslim Students Association homepage HUDA Islamic Gateway Muslim Students As iversity of Michigan About Islam and Muslims Carnegie Mellon Uni m Students Home Page Bookstore The Onli slamic Books Isl Islamic Texts and R University at Bu University of Warwick Islamic Society Muslim Students Ass at Lehigh University MSA of CSU El Sagrado Cor n Islamic Association Palestine Home Page Kutkut Islam Other MSAs and Organizations Other Resources rel versity at Buffal 777 Huma s Mamalist of Islamic Links Other MSAs ZUBAIR S ISLAM PAGE MIDDLE EAST CONFLICTS Islamic Links at the Arabic Paper Middle East Arab Hot Links MSA National MSAs Home Page Islamic Page Info about Muslims MSA SUNY Buffalo Untitled http www ev mideast islam htm Aalim Fevens Islam Home Page islam Links to MSAs THE ISLAM PAGE Recommendation systems Recommendation Systems Recommend docs to user based on user s context besides the docs content Other applications Re rank search results Locate experts Targeted ads Input Past transactions from users which docs viewed which products purchased pages bookmarked explicit ratings movies books Current context browsing history search es issued Explicit profile info Role in an enterprise Demographic info Interest profiles Example Users Docs viewed U1 d1 d2 U2 U1 viewed d1 d2 d3 U2 views d1 d2 Recommend d3 to U2 d3 Expert finding U1 d1 d2 U2 d3 In an enterprise setting recommend U1 to U2 as an expert Simple Algorithm U d1 d2 U viewed d1 d2 d5 d5 Look at who else viewed d1 d2 or d5 V W Recommend to U the doc s most popular among these users More formally D A U Aij 1 if user i viewed doc j
View Full Document