DOC PREVIEW
Stanford CS 224 - Study Notes

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

James Mao Spring 09-10CS 224N - Natural Language ProcessingFinal ProjectModification Identification in Recipe Comments1 AbstractRecipes are commonly modified for health, taste, ingredient availability reasons. A summary ofmodifications by other users of a recipe is useful in determining whether a proposed modificationwill be successful. We present a system for identifying modifications in comments posted on recipewebsites. As a secondary goal, our system design process demonstrates how we can combine humanunderstanding and machine learning techniques to rapidly construct a practical application with aminimal of annotated data.2 Introduction2.1 BackgroundThere is a proliferation of recipe websites and data available on the internet. While tools existsto assist in the discovery of the best recipes (e.g., recommendation systems based on user ratingsand history ala Netflix), there is no quick way of determining whether a particular recipe will issuitable. Typically, users are forced to skim through large number of comments, easily exceedinghundreds on the most popular recipes, in order to determine whether they can make a particularsubstitution (e.g., to replace an ingredient not on hand with one that is available) successfully.This is essentially a sentiment classification task (“I think this recipe should be modified”). Inthis domain, we are helped by the domain specific jargon[1] but we are hurt by the subtlenessof sentiments[2]. We will attempt to construct a system that collects the domain specific factorsthrough unsupervised learning and use a small human annotated data set to learn about thesemantics of various sentiments.2.2 Design ProcessOur main design methodology was to quickly build a commercially viable system (i.e. one thatcan be plugged into an existing website immediately). We do this through a combination of codereuse and smart pipeline development. We try to avoid custom code whenever possible, opting forwell supported libraries instead. Our processing pipeline is designed to be fast and robust, butmore importantly, the process to design the pipeline was designed to be intuitive. That is, we tryto demonstrate a method for developing other such classification tools quickly. This was achievedthrough a combination of supervised machine learning with small datasets and human reasoning. Inthe end, we hope to have gained insights into the problem rather than just a set of coefficients. As aresult, we prefer methods like decision trees and feature based classifiers over principal componentanalysis based classifiers.James Mao CS 224N - Final Project 23 Data3.1 SourcesMost recipe websites provide some sort of commenting facility and can be used as the input data.We chose the highest rated recipes for the most popular chefs on the Food Network website (http://www.foodnetwork.com).3.2 Collection, Extraction, and AnnotationSince our goal is to leverage common libraries (which may be available in different programminglanguages), we used an intermediate YAML representation for the data. This allows us to decouplethe data collection, extraction, and annotation task from the classification task.We created a Ruby based crawler that extracts a list of the most popular chefs and the raw markupfor each of their top recipe pages. We leveraged the mechanize gem (a port of WWW::Mechanizefor Ruby) to perform the core crawling task (e.g., following ”Next” links to gather recipe listingsthat are paginated). Once we have the HTML, we use the Nokogiri gem (an HTML parser withthe ability to search using XPath) as the core of our post processor. We extracts title, ingredients,and procedure for each recipe and the rating, title, and body for each comment to a recipe usinghand crafted XPaths. The advantage of using an XPath based extractor, which uses a parsed tree,over one that uses regular expressions is ease of change. That is, we can quickly adapt our crawlerand extractor to changes in the source website or to different websites. Once the data is extracted,it is stored in YAML files by recipe.The crawler and post processing is built on top of a Resque (a redis backed message queue).This allows us to parallelize the operations for performance and fault tolerance (e.g., timeouts maystall a particular fetch but the overall performance hit due to timeouts is negligible when the fetchis parallelized).Before annotating, we need to split the comments into sentences. At first, we tried using ourcustom regular expression based tokenizer but found that the Punkt sentence tokenizer[3] worksmuch better (this also fits with our philosophy of using well maintained libraries whenever possible).There is an implementation of this tokenizer in NLTK (the Natural Language Toolkit for Python).In order to leverage that implementation, we switch to Python for the remainder of our processingflow. We created a console annotation program to annotate each sentence for selected commentsafter tokenizing.4 Processing PipelineAfter the annotation process, we extract a variety of features and feed them to a maximum entropyclassifier (described in section 7). The complete processing pipeline is illustrated in figure 1.5 Testing MethodologyAs mentioned in our design process, our goal is to build a useful system using a minimum of humanannotated data. For our data set, we tagged a couple of recipes with over 100 comments, a fewrecipes with tens of comments, and about a dozen recipes with less than 5 comments. From whatJames Mao CS 224N - Final Project 3Figure 1: Our processing pipeline.James Mao CS 224N - Final Project 4we observed, this is approximately the same distribution as the entire recipe library we constructed.In particular, we add the recipes with few comments as a way to demonstrate the generality of oursystem. If we only include the recipes with lots of comments, our classifier will be rewarded forover fitting. The smaller recipes use different vocabulary so it generally reduces performance. Wewanted to create a scenario representative of the real world.Because of the small data set, system performance can be heavily dependent on training dataand test data selection. To avoid any semblance of bias, we use repeated random sub-samplingcross validation. Our data set contains approximately 1100 sentences. We random choose 250 ofthese as the test set and use the rest as the training set.6 Baseline Naive Bayes ClassifierFor comparison purposes, we use a naive Bayes classifier that uses the words of a sentence


View Full Document

Stanford CS 224 - Study Notes

Documents in this Course
Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?