Stanford CS 224N - Detecting Real-Time Messages of Public Interest in Tweets - D2159469

Home> Schools> Stanford University> Computer Science (CS) > CS 224N> Detecting Real-Time Messages of Public Interest in Tweets

Stanford CS 224N - Detecting Real-Time Messages of Public Interest in Tweets

School name Stanford University

Course Cs 224n- Natural Language Processing

Pages 13

Download Save

Unformatted text preview:

Detecting Real-Time Messages of Public Interest in TweetsDivij Gupta, Chanh Nguyen {divijg, chanhn}@cs.stanford.edu4th June, 20101 IntroductionA user base of over 80 million has positioned Twitter as one of the largest micro-bloggingplatforms in the world, providing a wealth of user generated content in the form of short140 character messages known as tweets. With users providing information on almost everyaspect of their lives and that of their surroundings through these location-tagged tweets, aninteresting proposition is to separate the information that might be useful for an individualor enterprise. To do so, we want to be able to detect whether a tweet was intended as apublic announcement, rather than just a status update or an opinion. For example given thetweet, "Live in Birmingham, this Friday 4th June with The Destroyers - ticketshere:http://lnk.ms/9NWV9 http://lnk.ms/9NWV9", we would like to be able to separate thistweet from a barrage of others which are of the form "I can now add 'freddy' to myextensive list of wrong names i've been called", and are essentially useless for ourpurposes.The inspiration for the project is the thought that what better place to look for eventpromotions or updates than on a site in which users are communicating in real-time.Consider a potential application: a user in a new city, say Berlin, looking for informationabout the best deals on sightseeing tours during that part of the year. Instead of usingmanually searching multiple sites via Google, which could possibly have outdatedinformation, our application could instead leverage Twitter's public API to search for tweetsabout deals on sightseeing tours in Berlin, providing the user with a lot more real-timeinformation and a better experience by automating the process. Hence one could extendthis idea to any type of event one would like to search for anywhere in the world whether ithas to do with food, music, movies, health, spirituality etc., just as long as people aretweeting about it.A potential application would then need to constantly analyze a stream of incomingtweets and separate the one which appear to be event broadcasts from those that are not.Keeping in mind the nature of the problem we employ a number of machine-learningtechniques to identify such tweets based on features extracted via natural languageprocessing principles as well as the characteristics of publicly-aimed tweets. With thenumber of positive instances being rare relative to the number of negative instances wewere aiming for a high negative accuracy rate so that most of the negatives would berejected as well as a respectable positive accuracy rate given that some tweets turn out tobe ambiguous in terms of them being intended as an announcement or not which mightlead to them being rejected by the system. For example "Back from study!!! Later I wantmy Macgriddle with Egg meal... Anyone want to join me?" might be an invitation to only theuser's group of friends, but could also be an open invitation to anybody who happens to bein the same area.Although some work has been done in this area (University of Tokyo - Semantic Twitter:Analyzing Tweets for Real-Time Event Notification by Makoto Okazaki and Yutaka Matsuo,the diagram shown below is based off one of their slides), their application queries twitterfor certain search terms which makes it easier to detect events, ours is based more on ageneral streaming system such that in the optimal case we can detect any possible publicannouncement.Figure 1. Possible architecture for a future application2 ResourcesOur primary data source was a set of tweets taken from the website of the Stanford classon data mining (cs345a). The dataset consists of about 25GB of tweets from June toDecember of the year 2009. Since no tagged dataset was available we had to manuallyconstruct out training and test sets which took a long time especially since the number ofpositive instances we encountered were very few. We specifically restrained ourselves frommanually searching twitter for positive instances in order to simulate the streaming of dataand to not bias our training or test sets.For every tweet we read, we took one of three possible actions. If the tweet looks like itwas meant to reach as many people as possible, such as the description of an event or aquestion asking for public opinion, we marked it as positive. If not, we marked it negative.If the tweet was written in a foreign language, was unreadable, or was too ambiguous interms of intended audience, then we would exclude it from our training or test sets.We constructed a training set with 545 positive instances and 1457 negative instances.We went through the first 2700 tweets of the December 2009 data set and classified eachtweet in one of the above three ways. The total does not add up to 2700 because at onepoint we began to skip everything but the positive examples because they were so rare. Inorder to find more publicly-aimed tweets, we connected to Twitter with the JTwitter API andsearched for tweets using terms such as "sale", "today", "tomorrow", "free", and "anyone."Performing such searches did not return only positive results. Only about 40% we classifiedas positive, 40% we classified as negative, and the rest we discarded. Adding an equalnumber of positive and negative examples to the training set helps to prevent bias on theparticular search term.For the test set we did not include any results obtained by performing a specific searchthrough the Twitter API because we wanted to test the system's ability to classify everymessage coming through Twitter. Thus, we went through a different data set, from June2009, and started from the beginning and classified the first 1400 tweets, resulting in a testset of 112 positive and 983 negative tweets.We employed four classification techniques, the first being Support Vector Machines(SVM's) for which we used the opensource LIBSVM package provided by Chih-Chung Changand Chih-Jen Lin from the National Taiwan University, the second being Logistic Regression,the third being Naive Bayes (in two variations), and the last being Boosted Decision Trees.The code for Logisitic regression and Naive Bayes was taken from the ones written forCS109 and adapted to the current problem. In order to construct the feature vector wemade use of WordNet from Princeton University's site as well as the Stanford Part of SpeechTagger provided by the Stanford NLP group. We used an open-source Java implementationof

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 224N - Detecting Real-Time Messages of Public Interest in Tweets

Sign up for free to view:

Please select your school