DOC PREVIEW
Stanford CS 224 - Study Notes

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS224N: Investigating SMS Text Normalization using Statistical MachineTranslationKarthik Raghunathan, Stefan KrawczykDepartment of Computer Science, Stanford University[rkarthik,stefank]@cs.stanford.eduAbstractIn this project we explore two approaches to SMS text normal-ization. First we try a dictionary substitution approach used bymost websites that provide such a service, and then modify itwith our extension. This is followed by a statistical machinetranslation (MT) approach using off the shelf MT tools. Weevaluate the performance of our system on three test sets fromdifferent sources and discuss the shortcomings of our systemand results.Index Terms: sms, english, machine translation, multiple do-mains1. StatementStefan wrote tools to help with the data preparation and per-formed the dictionary substitution evaluations. We both moreor less equally shared the tasks of annotating the data. Karthiktook care of setting up the MT pipeline because he had accessto a system and ran the machine translation scripts. We bothanalyzed the data and equally wrote up the report.2. IntroductionThere are many sites [1, 2, 3] on the internet that offer SMS En-glish to English translation services. However the technologybehind these sites is simple and uses straight dictionary subsi-tution, with no language model or any other approach to helpthem disambiguate between possible word substitutions.A reasonable alternative to this would be taking a noisychannel approach to modeling the translation from SMS En-glish to English. The English signal is sent across a noisy chan-nel, as an SMS, which we then try to recover using a languageand translation model. This should give us the ability to dis-ambiguate between ambiguous expansions of an SMS message.E.g. in ”do u noe how 2?” we need to disambiguate what eachof the tokens means, specifically the 2; it could map to ”to”,”two” or ”too”. With machine translation we hope to be able toproduce a better alternative for SMS English to English transla-tion, known as SMS text normalization in the natural languageprocessing community (NLP). For this task we use Moses, anopen source toolkit for statistical machine translation.Some could argue that we could take a spelling correctionapproach to the problem, but such a system do not usually con-sider the context and cannot handle forms that are really twowords, for example ”ru” (are you).The main contribution of this work is to present, evalu-ate and contrast the performance of two SMS normalizationapproaches on different sets of SMS messages from differentcountries, using off the shelf machine translation components.The rest of the paper is organized as follows. Section 3 and4 discusses prior work and corpus preparation. Section 5 ex-plains our approaches and section 6 their evaluations. Section 7discusses short comings of our results and section 8 concludes.3. Prior WorkThere is some prior work in using machine translation (MT)using SMS text. Aw et al [4] use MT in the context of normaliz-ing SMS messages before they are translated into Chinese. Thefirst employ a dictionary subsitution approach using frequen-cies, along with a bigram language model and compare that tousing machine translation (phrase based machine translation).They show that using MT boosts BLEU scores for SMS En-glish to English translation. Their data consisted of 5000 SMSmessages from the NUS corpus[10], which we are also using sowe should be able to get some comparable results. The corpusdoes not come with translated SMS messages, so they producedthe parallel English text themselves. They did not mention howthey handled smiley faces ”:)” or punctuation in the messageswhen translating. They do not use any standard tools for themachine translation task and hand build these themselves, butthey train their n-gram lanauge model on the English Gigawordcorpus from LDC. They do not test on any SMS messages thatare not from the NUS corpus and use the trigram BLEU scorefor their evaluations.Choudhury et al.[5] investigate the structure of SMS lan-gauge, using machine learning techniques to produce a HiddenMarkov Model (HMM) of SMS words. They focus on wordlevel decoding and not full message translation. E.g. trying todecode SMS words such as 2day (today), fne (phone) and dem(them). The HMM model they train allows them to decode un-seen words in which they are relatively successfully.Kobus et al [6] follows on from Aw and takes inspirationfrom Choudry by combining machine translation with an auto-mated speech recoginition approach using HMMs. They rea-son that with the Aw approach ”lexical creatitvity attested inSMS messages can hardly be captured in a static phrase table ...normalized phrases are learned by rote, rather than modeled.”so they take the speech decoding inspired process of modellingSMS words as an alphabetic/syllabic approximation of the pho-netic form and mapping it to a phone lattice which they thendecode.They use a French corpus of 25000 SMS messages to trainon and another 11700 to tune their parameters, which were low-ercased and stripped of punctuation signs. The machine transla-tion system they uses is Giza++[8] along with Moses [9], whichwe will also be using in our system, and the ASR-like systemthat they built. They show that separately the two systems per-form better in different respects, and show when combined theydecrease the word error rate by 1.5 times, increasing BLEUscores, but show that they still get at least one error on 60%of messages.4. Corpus PreparationWe decided to use the NUS corpus [10]as our main training setand the smaller HKU[12] and treasure my texts (TMT)[5] cor-puses that we found as testing sets. All but the TMT corpushad parallel translated text, so we had to tokenize and annotatethe data. We set about one weekend doing this, so we wrotesome tools to help us with this task. We created dictionariesand parallel text of about 3000 NUS SMS messages, 408 HKUmessage and 427 TMT messages. We decided to strip smileys,keep punctuation, leave words that we thought were names, ordid not know otherwise (for example the this was from Singa-pore so some abbreviations we did not know). The corpuseswere already lowercased, so we did not have to do that.To compare to the corpuses of the previous work, we aredrawing from the same corpus that Aw et al. used. The par-allel text they produced, replaced out of vocabulary words andnon-standard SMS linoges, removed slang (and local colloqui-alisms)


View Full Document

Stanford CS 224 - Study Notes

Documents in this Course
Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?