GT CS 4440 - LECTURE NOTES - D2631044

Home> Schools> Georgia Tech> Computer Science (CS) > CS 4440> LECTURE NOTES

GT CS 4440 - LECTURE NOTES

Pages 8

Download Save

Unformatted text preview:

Know your Neighbors: Web Spam Detectionusing the Web TopologyCarlos [email protected] [email protected] [email protected] [email protected] [email protected]! Research Barcelona2ISTI – CNRC/Ocata 1, 08003 Barcelona Via G. Moruzzi 1, 56124 PisaCatalunya, SPAIN ITALYABSTRACTWeb spam can significantly deteriorate the quality of searchengine results. Thus there is a large incentive for commer-cial search engines to detect spam pages efficiently and ac-curately. In this paper we present a spam detection systemthat comb ines link- based and content-based features, anduses the topology of the Web graph by exploiting the linkdependencies among the Web pages. We find that linkedhosts tend to belong to the same class: either both are spamor both are non-spam. We demonstrate three methods ofincorporating the Web graph topology into the predictionsobtained by our base classifier: (i) clustering the host graph,and assigning the label of all hosts in the cluster by major-ity vote, (ii) propagating the predicted labels to neighbor-ing hosts, and (iii) using the predicted labels of neighboringhosts as n ew features and retraining the classifier. The re-sult is an accurate system for detecting Web spam, testedon a large and public dataset, using algorithms that can beapplied in practice to large-scale Web data.Categories and Subject Descriptors: H.4.m [Informa-tion Systems Applications]: MiscellaneousGeneral Terms: Algorithms, Measurement.Keywords: Link spam, Content spam, Web spam1. INTRODUCTIONTraditional information retrieval algorithms were devel-oped for relatively small and coherent document collectionssuch as newspaper articles or book catalogs in a library.Very little, if any, of the content in such systems could bedescribed as “spam.” In comparison to these collections, theWeb is massive, changes more rapidly, and is spread over ge-ographically distributed computers [2]. Distinguishing be-tween desirable and undesirable content in such a systemPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGIR’07, July 23–27, 2007, Amsterdam, The Netherlands.Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00.presents a significant challenge, as every day more peopleare using search engines more often.Search engine spamming, also known as spamdexing,encompasses malicious attempts to infl uence the outcomeof ranking algorithms, for the purpose of getting an unde-servedly high rank. Obtaining a higher rank is strongly cor-related with traffic, and a higher rank often translates tomore revenue. Thus, t here is an economic incentive for Website owners to invest on spamming search engines, insteadof investing on improving their Web sites. Spamming theWeb is cheap, and in many cases, successful.Web spam is not a new problem, and is not likely to besolved in the near futu re. According to Henzinger et al. [17]“Spamming has become so prevalent that every commercialsearch engine has had to take measures to identify and re-move spam. Without such measures, the quality of the rank-ings suffers severely”. Web spam damages the reputationof search engines and it weakens the trust of its users [16].For instance, Eiron et al. [12] ranked 100 million pages us-ing PageRank [23] and found that 11 out of the top 20were pornographic pages, which achieved such high rankingthrough link manipulation, indicating that th e PageRank al-gorithm is highly susceptible to spam. Spamming techniquesare so widely known t hat there have been even spammingcompetitions (e.g., the contest to rank highest for the query“nigritude u ltramarine” [11] among others).From the perspective of the search engine, even if thespam pages are not ranked sufficiently high to annoy users,there is a cost to crawling, indexing and storing spam pages.Ideally search engines would like to avoid spam pages al-together before they use resources that might be used forstoring, indexing and ranking legitimate content.Overview of our approach. We start by building anautomatic classifier that combin es a set of link-based andcontent-based features. In general, traditional machine learn-ing methods assume that the data instances are indepen-dent. In th e case of the Web there are dependencies amongpages and hosts. One su ch dependency is that links are notplaced at random and in general, similar p ages tend to belinked together more frequently than dissimilar ones [10].Such a dependency holds also for spam pages and hosts:spam tends to be clustered on the Web. One explanationfor this behavior is that spam pages often adopt link-basedrank-boosting techniques such as link-farming. These tech-niques can be as simple as creating a pool of pages linkingSIGIR 2007 Proceedings Session 17: Spam Spam Spam423Figure 1: Graphical depiction of the hostgraph (undirected), prunned to include only labeled nodes with aconnection of over 100 links between them. Black nodes are spam, white nodes are non-spam. Most of thespammers in the larger connected component are clustered together (upper-right end of the center portion).Most of the other connected components are single-class (either only spam nodes, or only non-spam nodes).to a page whose rank is to be raised. In practice spammersuse sophisticated structures that are difficult to detect.We investigate techniques that exploit the connections be-tween spam hosts in order to improve the accuracy of ourclassifiers. We assume that hosts that are well-linked to-gether are likely to have the same class label (spam or non-spam). More generally, we can assume that two hosts in thesame class should be connected by short paths going mostlythrough hosts in the same class.Figure 1 shows a visualization of the host graph in theWeb spam collection we are using (described in Section 3).An edge between two hosts is shown only if there are at least100 links between the two hosts. In the figure, black nodesare spam and white nodes are non-spam. The layout of thenodes in the figure was computed using a spring mod el. Forthe larger connected component of the graph, we can seethat spam

View Full Document


School:
Email:
New Password:
Confirm Password:

GT CS 4440 - LECTURE NOTES

Sign up for free to view:

Please select your school