The influence of search engines on preferential attachmentThe paperBackgroundSlide 4The New ProblemSlide 6The Results in This PaperThe New ModelSlide 9Slide 10The New Model – CommentsSome Notations in the new modelFormal Definition of Process PSlide 14The New Model - CommentsThe simulation resultsSlide 17ResultsTheorem 1InterpretationsThe ProofCoupling Gt and Gt*Analysis of the degree distribution of Gt*Basic Proof to Lemma 2Lemma 3Basic Proof to Lemma 3The celebrity list get fixedList-fixing LemmaProof to Lemma 4Lemma 5Lemma 6Lemma 7Lemma 8Proof of Theorem 1Proof of Theorem 1 cont.Slide 36Slide 37Slide 38Slide 39ConclusionsIs this Model real?CommentsThank you!The influence of search engines on preferential attachmentDan LiCS3150 Spring 2006The paperThe influence of search engines on preferential attachmentSoumen Chakrabarti, Alan Frieze and Juan VeraBackgroundThe evolution of social networks through timeWeb graphModelsPreferential AttachmentCopying ModelBackgroundEvolution of the WebPower-lawPreferential attachment( Barabasi and Albert)Copying ModelThe author of a newborn page u picks a random reference page v from the web, and with some probability, copies out-links from v to u.Power-law: power ~ 2Organic EvolutionNO POWERFUL CENTRAL ENTIRY! 3)degreePr( kkThe New ProblemHow the page authors find existing pages and create links to them? Highly popular search engines limit the attention of the page authors to a small set of celebrity pages.Page authors frequently use search engines to locate pages, and include the HOT pages they visit (with probability p)The New ProblemThe evolution of the Web graph has been influenced permanently and pervasively by the existence of search engines.A search engine ranks a page highly,Authors find the page more often, some of them link to it, raising its in-degree and Pagerank, which leads to a further improvement or entrenchment of its rank.The Results in This PaperThe celebrity nodes eventually accumulate a constant fraction of all links created with high probabilityThe degree of the other nodes still follow a power-law distribution with a steeper power:3121 pPowerThe))1/(21()Pr(degpkkreeThe New ModelModeling how the web graphs evolves if the author use search engine to decide on links that they insert into new pages.How the degree distribution deviates from the traditional modelThe New ModelUndirected Web GraphQuery to the Search Engine is fixedThe search Engine returns a fix number of URLs ordered by their degree at the previous time-stepLimit the analysis to one topic at a time with out loss of generality??Comments: A new page may involve multiple topics at the same time and include different number of links for each topic.The New ModelGrowth process:Generates a sequence of graphs Gt, t =1,2,3,… At time t, the Graph Gt = (Vt, Et) has t vertices and mt edges.Parameters: p: a probabilityN: maximum number of celebrity nodes listed by the search engineThe New Model – Comments Comments:The number of links each new page creates is fixed? Is this real? How does this affect the results?Intuitively, the page author may not have a number in mind of how many links he wants to include, he will only determine whether a link will be included based on the content of that linkSome Notations in the new modelFormal Definition of Process PThe New ModelIn both cases yi is selected by preferential attachment within the target subset of old nodes, i.e. for x in UThe New Model - CommentsThe m random edges may have duplicate vertices. For different i, the same vertex may be selected! When t is smaller than m, we have a lot of loops.Should we not start from one vertex? Instead, we can start from m vertices or N vertices and the initial web graph is created at random.With high probability, the oldest links become celebrity page.What happens in the real world?A page becomes hot not only by random, but also due to its contents, can we model this??The simulation resultsVery different from the standard preferential attachment!The celebrities is far from the Power-Law straight line in log-log plot.As p increases, the power increases as well!P Simulated power Computed powerP = 0 2.8 3P = 0.3 3.96 3.857P = 0.6 5.9 6The celebrities command a constant fraction of the total degree over all nodes, this fraction grows with p.The simulation resultsResultsTheorem 1InterpretationsCelebrities capture a large? (depends on the constant) fraction of links.Non-celebrities follow a power-law degree distribution with a power steeper than in preferential attachment.The ProofThe celebrity list becomes fixed whp after some time tfOnce the celebrity list is fixed, process P looks very similar to an analogous process P*:In each step, P* takes the N oldest vertices as St, instead of the N largest-degree vertices.This is quite reasonable, basically, the oldest vertices have higher degree, since they have longer time to be includedCoupling Gt and Gt*Analysis of the degree distribution of Gt*Basic Proof to Lemma 2Finding recurrence of Finding a similar recurrence:Lemma 3Basic Proof to Lemma 3The celebrity list get fixedWHP, adding m edges to a single non-celebrity will not make it a celebrity.The total degree of celebrities is concentrated to a constant fraction of all edges ever added to the graphList-fixing LemmaProof to Lemma 4Lemma 5Lemma 6With low degree, the celebrity has low degreeLemma 7With low probability, the non-celebrity has high degreeLemma 8With low probability, the gap will keep smallProof of Theorem 1Lst tf to be the last time that St changes in the process PProof of Theorem 1 cont.Proof of Theorem 1 cont.Proof of Theorem 1 cont.Proof of Theorem 1 cont.Proof of Theorem 1 cont.ConclusionsModeling the influence of a search engine within the preferential attachment framework leads to a qualitative change in the familiar power-law degree distribution. Each of a clot of celebrities captures a constant fraction of the total degree of the graph, and the degree of the remaining nodes follow a steeper power law.Is this Model real?The model differs from the reality.Edges are undirected?Outlinks are not modified after creationPages do not dieNo topic-based clusteringCommentsThis model is used on to one topicThere may be
View Full Document