GT CS 8803 - Automated Classification of Spam Based on Textual Features - D2338806

Home> Schools> Georgia Tech> Computer Science (CS) > CS 8803> Automated Classification of Spam Based on Textual Features

GT CS 8803 - Automated Classification of Spam Based on Textual Features

School name Georgia Tech

Course Cs 8803- Special Topics

Pages 6

Download Save

Unformatted text preview:

CS 8803 – AIADProf Ling LiuProject Proposal for Automated Classification of Spam Based on Textual FeaturesGopal PaiUnder the supervision of Steve WebbMotivations and ObjectivesSpam, which was until recently a characteristic which was associated only with emails has spilled over to the web. Web Spam is a new problem which is prevalent due to people trying to find ways around search engine algorithms in the aim of getting their website ranked higher on search engines. In many cases, the content on these sites may be inferior and it may either not be of any benefit to the user to visit these sites or it may even be harmful if the site is malicious. This definition is a hazy one and classification of pages as spam may vary from person to person. In this project, we attempt to design a system which classifies a page based on its content into distinct categories of spam and ham. This system if successfully built would also provide a method to generate a heuristic for the quality of a page without having to store huge crawls and process these using multiple servers in real time. Although such a methodology might not be as reliable as a rank got from a regular web crawl, depending on its robustness, it might possibly be put to various other uses which follow.- Email Spam Detection - This can be used in a application which can feed back into a email spam classifier. For every email which has a link in it, a system which uses a heuristic such as the one specified by us can assign a quality metric to the link in the email. Depending on the quality of links in the mail, the mail can be classified as spam / non spam.- Personalized ratings for web pages - This can be used in an application where some knowledge can be gained from the pages that a user likes / dislikes pages once the system has enough data, it can associate ratings automatically for pages that the user browses. The data for the application depends on the user manually classifying a set of pages as spam or ham. Also a personalized search can be created using this where we can have an intermediate analyzing layer between the normal search results and the user so that the results can be reordered so that the ones which are most likely to be appealing to the user are on top.- This technique can be extended into classification into classes other than spam and ham. For example, detecting whether a page is safe for kids is another problem which can be tackled using this. Related WorkThis project follows the efforts of Fetterly et al [1] and Drost et al [2]. In [1], the authors investigated a set of metrics pertaining to web pages and showed that for those metrics, spam pages follow a definite pattern which is distinct from pages which can be termed as ham. They proved that it is possible to detect distinct outliers in the graph plots of these metrics which can be termed as spam. As metrics in this study, they used features like length of hostnames, host to machine ratios , out links from a page and inlinks. They concluded that the graph structure of the web is not the only criteria to classify pages, but that simpler features like these could also be used to build a heuristic ranking system. In [2], the authors tried to build a simple system to classify link spam. We attempt to build a more robust system using a greater amount of features from the web pages and build a similar system which can classify web pages into spam and ham.Machine Learning has also been used to improve search results by filtering Nepotistic links in [4]and work of classifying web sites into different categories has been performed in [5]. Similar work has also been done in the field of classifying emails into these categories by the process of reading tokens from the mail, using the headers from the mail as various features using which a support vector machine based classifier can be built to segregate the messages into distinct categories. In email spam classification, using a black list and white list of servers and checking the frequency of mails from a particular ip are also used as other categories using which spam is flagged. The only problem associated with such a learning system is that a reasonably large database of previously classified mails is required to generate the first model for the system. We define a simple technique for obtaining such a database and also define a set of metrics which might be useful and propose the creation of such a system for web pages instead of emails.Proposed WorkTo start with the system we need a set of pages which we can use as a training set for our classifier. We now discuss the source for this training set. In order to get preclassified spam pages, we use an email spam archive being maintained at Georgia Tech. We extract links from these spam emails and we assume that they are low quality pages in terms of web rank as well. For the preclassified ham pages, we use a list of URLs which have been classified with a high value by James Caverlee in his Source Rank Research. Once we have this list of URLs we intend to crawl them and save the HTML page corresponding to each of them.We have started with a broad list of metrics which would feed as features from these pages into the classifier. These features can be classified into the following categories. - Meta Data on the Page- Tags on the web page- Content in the web page- Whois information of the domain- Domain Information- Inlinks to the pageCurrently we have about 50 features without taking into account the text associated with the page.Our first attempt will be to grab all of these metrics from the set of pages which we have crawled. We will then experimentally figure out the metrics which are most distinct in terms of spam and ham pages. We will use these set of features as the important ones in our classification phase.Once the main set of features are got, we feed the data into a custom classifier written using WEKA [3]. Weka is a collection of open source machine learning algorithms for data mining tasks. It contains tools for data pre-processing, classification, regression, clustering, association rules . We intend to use different classification techniques to build a model around the existing data. Naive Bayesian, Support Vector Machines with hard and soft bounds and with varying kernels are different instances of learning techniques whose behavior we intend to study on this data. At this point, new data can be fed into the system

View Full Document


School:
Email:
New Password:
Confirm Password:

GT CS 8803 - Automated Classification of Spam Based on Textual Features

Sign up for free to view:

Please select your school