New version page

becker35-icwsm2011

This preview shows page 1 out of 4 pages.

View Full Document
View Full Document

End of preview. Want to read all 4 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

Beyond Trending Topics: Real-World Event Identification on TwitterHila BeckerColumbia [email protected] NaamanRutgers [email protected] GravanoColumbia [email protected] messages on social media sites such asTwitter have emerged as powerful, real-time means of infor-mation sharing on the Web. These short messages tend to re-flect a variety of events in real time, making Twitter partic-ularly well suited as a source of real-time event content. Inthis paper, we explore approaches for analyzing the streamof Twitter messages to distinguish between messages aboutreal-world events and non-event messages. Our approach re-lies on a rich family of aggregate statistics of topically sim-ilar message clusters. Large-scale experiments over millionsof Twitter messages show the effectiveness of our approachfor surfacing real-world event content on Twitter.1 IntroductionSocial media sites (e.g., Twitter, Facebook, and YouTube)have emerged as powerful means of communication for peo-ple looking to share and exchange information on a wide va-riety of real-world events. These events range from popular,widely known ones (e.g., a concert by a popular music band)to smaller scale, local events (e.g., a local social gathering, aprotest, or an accident). Short messages posted on social me-dia sites such as Twitter can typically reflect these events asthey happen. For this reason, the content of such social me-dia sites is particularly useful for real-time identification ofreal-world events and their associated user-contributed mes-sages, which is the problem that we address in this paper.Twitter messages reflect useful event information for avariety of events of different types and scale. These eventmessages can provide a set of unique perspectives, regard-less of the event type (Diakopoulos, Naaman, and Kivran-Swaine 2010; Yardi and boyd 2010), reflecting the pointsof view of users who are interested or participate in anevent. In particular, for unplanned events (e.g., the Iran elec-tion protests, earthquakes), Twitter users sometimes spreadnews prior to the traditional news media (Kwak et al. 2010;Sakaki, Okazaki, and Matsuo 2010). Even for plannedevents (e.g., the 2010 Apple Developers conference), Twitterusers often post messages in anticipation of the event.Identifying events in real time on Twitter is a challengingproblem, due to the heterogeneity and immense scale of thedata. Twitter users post messages with a variety of contentCopyrightc 2011, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.types, including personal updates and various bits of infor-mation (Naaman, Boase, and Lai 2010). While much of thecontent on Twitter is not related to any particular real-worldevent, informative event messages nevertheless abound. Asan additional challenge, Twitter messages, by design, con-tain little textual information, and often exhibit low quality(e.g., with typos and ungrammatical sentences).Several research efforts have focused on identifyingevents in social media in general, and on Twitter in particular(Becker, Naaman, and Gravano 2010; Sakaki, Okazaki, andMatsuo 2010; Sankaranarayanan et al. 2009). Recent workon Twitter has started to process data as a stream, as it isproduced, but has mainly focused on identifying events ofa particular type (e.g., news events (Sankaranarayanan et al.2009), earthquakes (Sakaki, Okazaki, and Matsuo 2010)).Other work identifies the first Twitter message associatedwith an event (Petrovi´c, Osborne, and Lavrenko 2010).Our focus in this work is on online identification of real-world event content. We identify each event—and its asso-ciated Twitter messages—using an online clustering tech-nique that groups together topically similar tweets (Section3.1). We then compute revealing features for each cluster tohelp determine which clusters correspond to events (Section3.2). We use these features to train a classifier to distinguishbetween event and non-event clusters (Section 3.3). We val-idate the effectiveness of our techniques using a dataset ofover 2.6 million Twitter messages (Section 4) and then dis-cuss our findings and future work (Section 5).2 Background and Problem DefinitionIn this section, we provide an overview of Twitter and thendefine the problem that we address in this paper.2.1 Background: TwitterTwitter is a popular social media site that allows users to postshort textual messages, or tweets, which are up to 140 char-acters long. Twitter users can use a hashtag annotation for-mat (e.g., #sb45) to indicate what their messages are about(e.g., “watching Superbowl 45 #sb45”). In addition, Twit-ter allows several ways for users to converse and interactby referencing each other in messages using the @ sym-bol. Twitter currently employs a proprietary algorithm todisplay trending topics, consisting of terms and phrases thatexhibit “trending” behavior. While Twitter’s trending topicssometimes reflect current events (e.g., “world cup”), they of-ten include keywords for popular conversation topics (e.g.,“#bieberfever,” “getting ready”), with no discrimination be-tween the different types of content.2.2 Problem DefinitionWe now define the notion of real-world event in the contextof a Twitter message stream, and provide a definition of theproblem that we address in this paper.The definition of event has received attention acrossfields, from philosophy (Events 2002) to cognitive psychol-ogy (Zacks and Tversky 2001). In information retrieval, theconcept of event has prominently been studied for event de-tection in news (Allan 2002). We borrow from this researchto define an event in the context of our work. Specifically,we define an event as a real-world occurrence e with (1) anassociated time period Teand (2) a time-ordered stream ofTwitter messages Me, of substantial volume, discussing theoccurrence and published during time Te.According to this definition, events on Twitter includewidely known occurrences such as the presidential inaugu-ration, and also local or community-specific events such as ahigh-school homecoming game or the ICWSM conference.Non-event content, of course, is prominent on Twitter andsimilar systems where people share various types of con-tent such as personal updates, random thoughts and musings,opinions, and information (Naaman, Boase, and Lai 2010).As a challenge, non-event content also includes formsof


Loading Unlocking...
Login

Join to view becker35-icwsm2011 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view becker35-icwsm2011 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?