CMU CS 10701 - agarwal_arora - D89565

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10701> agarwal_arora

DOC PREVIEW

CMU CS 10701 - agarwal_arora

School name Carnegie Mellon University

Course Cs 10701- Introduction to Machine Learning

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Context Based Word Prediction for Texting1 Language Sachin Agarwal Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh PA 15213 sachina cs cmu edu Shilpa Arora Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh PA 15213 shilpaa cs cmu edu Abstract The use of digital mobile phones has led to a tremendous increase in communication using SMS On a phone keypad multiple words are mapped to same numeric code We propose a Context Based Word Prediction system for SMS messaging in which context is used to predict the most appropriate word for a given code We extend this system to allow informal words short forms for proper English words The mapping from informal word to its proper English words is done using Double Metaphone Encoding based on their phonetic similarity The results show good improvement over the traditional frequency based word estimation 1 I n tro d u cti o n The growth of wireless technology has provided us with many new ways of communication such as SMS Short Message Service SMS messaging can also be used to interact with automated systems such as ordering products and services for mobile phones or participating in contests With tremendous increase in Mobile Text Messaging there is a need for an efficient text input system With limited keys on the mobile phone multiple letters are mapped to same number 8 keys 2 to 9 for 26 alphabets The many to one mapping of alphabets to numbers gives us same numeric code for multiple words Predictive text systems in place use the frequency based disambiguation method and predict the most commonly used words above other possible words T 9 Text on 9 keys 1 developed by Tegic Communications is one such predictive text technology used by LG Siemens Nokia Sony Ericson and others in their phones iTap is another similar system developed and used by Motorola in their phones T 9 system predicts the correct word for a given numeric code based on frequency This may not give us the correct result most of the time For example for code 63 two possible words are me and of Based on a frequency list where of is more likely than me T 9 system will always predict of for code 63 So for a sentence like Give me a box of chocolate the prediction would be Give of a box of chocolate The sentence itself indeed gives us information about what should be the correct word for a given code Consider the above sentence with blanks Give a box chocolate According to the English grammar it is more likely that of comes after a 1 SMS Text language noun box than me i e it is more likely to see the phrase box of than box me The algorithm proposed is an online method that uses this knowledge to correctly predict the word for a given code considering its previous context An extension of T 9 system called T 12 was proposed by UzZaman et al 2 They extend the idea of T 9 to what we call informal language which is used in Text messaging a lot This includes abbreviation acronyms short forms of the words based on phonetic similarity e g gr8 for great They use the Metaphone Encoding 3 technique to find phonetically similar words And from among those phonetically similar words they choose the appropriate word using string matching algorithms such as edit distance between the word and its normalized form However the edit distance measure suggests words such as create for informal word gr8 In the proposed method the context information is used to choose the appropriate word 2 P ro b l e m S ta te men t The current systems for word prediction in Text Messaging predict the word for a code based on its frequency obtained from a huge corpus However the word at a particular position in a sentence depends on its context and this intuition motivated us to use Machine Learning algorithms to predict a word based on its context The system also takes into consideration the proper English words for the codes corresponding to the words in informal language Although the method has been proposed for a text messaging system it is applicable in a number of other domains as well The informal and formal language mixture discussed here is also used in instant messaging and emails The proposed method can also be used to convert a group of documents in informal language into formal language These days even non personal discussions over emails IM between friends colleagues students is done in a more informal language but if someone were to make use of these discussions formally then our system can automatically do the conversion or suggest appropriate conversions 3 P ro p o s ed Meth o d The proposed method uses machine learning algorithms to predict the current word given its code and previous word s part of speech POS The workflow of the system is as shown in Figure 1 The algorithm predicts the current word after training a Markov Model using Enron email corpus 4 since short emails resemble SMS messages closely code Code to Word Mapper Valid Dictionary Words Word Predictor Most Probable word Context Dictionary Figure 1 Workflow for Context Based Word Prediction System for formal lanaguge The code word and its POS are three random variables in the model The dependency relationship between these variables can be modeled in different ways The graphical models with different representations of this relationship are discussed below The bi gram model is used to predict the most probable word and POS pair given its code and previous word s POS I First Model Figure 2 Graphical Model used for Context based word prediction In this model word is dependent on its code and the part of speech is dependent on the word and part of speech of previous word Here Ct refers to the numeric code for tth word in a sentence W t refers to tth word in a sentence and St refers to the part of speech of t th word in the sentence Let W t 1 W t be a sequence of words where W t 1 is to be predicted and W t is known Also Ct 1 and S t are known We need to learn the P W S C S P W t 1 C t 1 S t 1 S t t 1 t 1 t 1 t P C t 1 S t The joint probability distribution from the graphical model using factorization theorem is given as P Wt 1Ct 1S t 1 S t P S t 1 Wt 1S t P Wt 1 Ct 1 P Ct 1 P S t Hence P W S C S P S t 1 W t 1 S t P W t 1 C t 1 P C t 1 P S t t 1 t 1 t 1 where P C S t 1 t t P W P C t 1 S t t 1 C t 1 S t 1 S t W …

View Full Document

CMU CS 10701 - agarwal_arora

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 8 pages.

CMU CS 10701 - agarwal_arora

Sign up for free to view:

Please select your school