Stanford CS 224n - Study Notes - D879190

Home> Schools> Stanford University> Computer Science (CS) > CS 224n> Study Notes

Stanford CS 224n - Study Notes

Course Cs 224n- Natural Language Processing with Deep Learning

Pages 10

Download Save

Unformatted text preview:

CS224N : Predicting Category Tagsby Using N-gram ModelsHeeTae Jung Jonathan Hernandez YongRim RheeJune 3, 20091 IntroductionThe blogosphere is a highly populated medium in which people communicateand share their ideas. Amidst recent data explosion generated by bloggers, orga-nization of these blogs has become an important task. Accordingly, a number ofstudies. e.g. [1, 2, 3], have been conducted to predict/suggest category tags forblog entries since tagging allows ranking and data organization to directly uti-lize inputs from end users. We implemented a category suggestion system usingN-gram models. We also explore how category expanding and category cluster-ing can affect the performance of the system. Our experimentation reveals thatstochastic language models can effectively used to predict categories.2 DataWe used the ICWSM weblog data for our experiments. ICWSM data is cat-egorized into 14 different tier groups based on the ranking measured by analgorithm originated from TailRank. In the dataset, a blog entry is categorizedinto a higher tier group when it is measured to be more influential than otherblog entries. Entries are categorized into 13 tier groups based on the ranking,and all the data that are not ranked are categorized into the non-tier group.The ICWSM data covers August and September 2008, and is over 30GBcompressed. Looking through the data, we noticed that many of the entries werenot in English and that there were entries from Craigslist labeled as “classifieds.”A lot of blog entries did not have any categories assigned to them. We decidedwe wanted to focus on blog entries in English that are actual blog entries asopposed to classifieds, and that they must have at least one category assignedto them. However, even limiting the data to this set of results is still a lot ofdata, and too much to process.One of the major blog hosting websites is Wordpress. Choosing a major bloghosting site also solved another issue. The data contained a lot of very shortentries from MySpace, but limiting it to Wordpress removes these. To limit ourdata, we decided to limit our results to only sites hosted on Wordpress.com. As1this is one of the larger blog hosting sites out there, even limiting to this hostingsite left us with too much data, so we decided to limit our data to tiers 1-3 and toblogs written on and between September 12th and September 18th. One of thereasons we chose these dates was that during this time period was when LehmanBrothers filed for bankruptcy(http://en.wikipedia.org/wiki/2008#September).Using these restrictions reduced the number of blog entries to around 20,000.When running, we had five sets of data: one set was all 20,000 entries fortraining and 2000 for testing. The other sets were subsets of this first set -and contained only those blog entries which had a category that was popular.Popular categories are determined based on their number of occurrences in ourdata set.data set train data size test data sizewhole data 20,000 2,000threshold = 50 9,000 1,000threshold = 100 7,000 1,000threshold = 150 6,000 1,000threshold = 200 4,000 1,000For example, when the threshold was set to 200, only the blog entries thathad categories appearing more than and equal to 200 times were selected. Thatleft us 5,000 blog entries in total, and we used 4,000 entries for training and1,000 entries for testing.3 Out Approach3.1 N-gram ModelsThe Wordpress ICWSM blogs corpus contains, in its XML format, category tagrepresenting the phrases the authors chose to tag their entries with. Using thistag and the sentences in the entry, we create and train 3 models for extractingcategories. First in our baseline model we score the the likelihood of a categoryfor word w for category c =number of times word w seen by the categorytotal number of words seen by the category. We thenselect a number of categories that returns the highest scores as a suggestion.Though simple and effective, this approach does not take into account thateach category could have distinct language models. The intuition behind thisis that certain words may appear more often under certain categories. To builda language model for each of the categories, we build a unigram, bigram, andtrigram languages model for each of the categories during training. We furtherimplemented an interpolated model for all three of the N gram models. However,due to the limited computing resources available, we experiment only with thesurface likelihood model, unigram and bigram model. In dealing with unseenwords and sparcity of vocabulary introduced by the bigram model, we employabsolute discounting to smooth the ngram language models.Evaluating the performance of our implementation, we pick the 6 highestscoring categories to compare to the testing data. When looking at the data,2we saw that on average, each blog entry has about four categories, and wedecided to let our algorithm assign up to 1.5 times the average, which is six.3.2 Category ExpandingBloggers give a number of category tags when they write blog entries. We ex-amined the number of category tags that each blog entry has, and it turned outthat each blog entry was assigned 3.9 category tags in average. This group ofcategories, usually called co-occurring categories, are assumed to jointly repre-sent the content of the blog entry. Ciro, in [4], statistically studied the closerelationship among category tags of the same blog entry and revealed that thereis non-trivial relationship among them. The co-occurring tags have been usedin tag clustering [5] and tag visualization system [6]. In our experiment, weexplored if this co-occurrence information can be directly used to improve theperformance of category tag prediction. Basically, we counted how often eachcategory appears with other categories in the same blog postings by using the al-gorithm below, and the built map is directly used to expand the list of predictedcategories.for each blog bfor all possible pairs of categories (c1, c2) of bincrement the count by 1end forend forFirst, our N-gram model predicts the list of six categories which have thehighest probabilities. Given this list, it is expanded by referring to the map builtby the algorithm above. The list of categories can be expanded by either addingthe mostly-related categories, i.e. the category which mostly co-occurred withthe given category tag, or adding all the related categories, i.e. all the categorytags which co-occurred with the given category tag.3.3 Category

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 224n - Study Notes

Sign up for free to view:

Please select your school