DOC PREVIEW
CMU CS 10701 - agarwal_arora

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Context Based Word Prediction for Texting1 Language Sachin Agarwal Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh PA 15213 sachina cs cmu edu Shilpa Arora Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh PA 15213 shilpaa cs cmu edu Abstract The use of digital mobile phones has led to a tremendous increase in communication using SMS On a phone keypad multiple words are mapped to same numeric code We propose a Context Based Word Prediction system for SMS messaging in which context is used to predict the most appropriate word for a given code We extend this system to allow informal words short forms for proper English words The mapping from informal word to its proper English words is done using Double Metaphone Encoding based on their phonetic similarity The results show good improvement over the traditional frequency based word estimation 1 I n tro d u cti o n The growth of wireless technology has provided us with many new ways of communication such as SMS Short Message Service SMS messaging can also be used to interact with automated systems such as ordering products and services for mobile phones or participating in contests With tremendous increase in Mobile Text Messaging there is a need for an efficient text input system With limited keys on the mobile phone multiple letters are mapped to same number 8 keys 2 to 9 for 26 alphabets The many to one mapping of alphabets to numbers gives us same numeric code for multiple words Predictive text systems in place use the frequency based disambiguation method and predict the most commonly used words above other possible words T 9 Text on 9 keys 1 developed by Tegic Communications is one such predictive text technology used by LG Siemens Nokia Sony Ericson and others in their phones iTap is another similar system developed and used by Motorola in their phones T 9 system predicts the correct word for a given numeric code based on frequency This may not give us the correct result most of the time For example for code 63 two possible words are me and of Based on a frequency list where of is more likely than me T 9 system will always predict of for code 63 So for a sentence like Give me a box of chocolate the prediction would be Give of a box of chocolate The sentence itself indeed gives us information about what should be the correct word for a given code Consider the above sentence with blanks Give a box chocolate According to the English grammar it is more likely that of comes after a 1 SMS Text language noun box than me i e it is more likely to see the phrase box of than box me The algorithm proposed is an online method that uses this knowledge to correctly predict the word for a given code considering its previous context An extension of T 9 system called T 12 was proposed by UzZaman et al 2 They extend the idea of T 9 to what we call informal language which is used in Text messaging a lot This includes abbreviation acronyms short forms of the words based on phonetic similarity e g gr8 for great They use the Metaphone Encoding 3 technique to find phonetically similar words And from among those phonetically similar words they choose the appropriate word using string matching algorithms such as edit distance between the word and its normalized form However the edit distance measure suggests words such as create for informal word gr8 In the proposed method the context information is used to choose the appropriate word 2 P ro b l e m S ta te men t The current systems for word prediction in Text Messaging predict the word for a code based on its frequency obtained from a huge corpus However the word at a particular position in a sentence depends on its context and this intuition motivated us to use Machine Learning algorithms to predict a word based on its context The system also takes into consideration the proper English words for the codes corresponding to the words in informal language Although the method has been proposed for a text messaging system it is applicable in a number of other domains as well The informal and formal language mixture discussed here is also used in instant messaging and emails The proposed method can also be used to convert a group of documents in informal language into formal language These days even non personal discussions over emails IM between friends colleagues students is done in a more informal language but if someone were to make use of these discussions formally then our system can automatically do the conversion or suggest appropriate conversions 3 P ro p o s ed Meth o d The proposed method uses machine learning algorithms to predict the current word given its code and previous word s part of speech POS The workflow of the system is as shown in Figure 1 The algorithm predicts the current word after training a Markov Model using Enron email corpus 4 since short emails resemble SMS messages closely code Code to Word Mapper Valid Dictionary Words Word Predictor Most Probable word Context Dictionary Figure 1 Workflow for Context Based Word Prediction System for formal lanaguge The code word and its POS are three random variables in the model The dependency relationship between these variables can be modeled in different ways The graphical models with different representations of this relationship are discussed below The bi gram model is used to predict the most probable word and POS pair given its code and previous word s POS I First Model Figure 2 Graphical Model used for Context based word prediction In this model word is dependent on its code and the part of speech is dependent on the word and part of speech of previous word Here Ct refers to the numeric code for tth word in a sentence W t refers to tth word in a sentence and St refers to the part of speech of t th word in the sentence Let W t 1 W t be a sequence of words where W t 1 is to be predicted and W t is known Also Ct 1 and S t are known We need to learn the P W S C S P W t 1 C t 1 S t 1 S t t 1 t 1 t 1 t P C t 1 S t The joint probability distribution from the graphical model using factorization theorem is given as P Wt 1Ct 1S t 1 S t P S t 1 Wt 1S t P Wt 1 Ct 1 P Ct 1 P S t Hence P W S C S P S t 1 W t 1 S t P W t 1 C t 1 P C t 1 P S t t 1 t 1 t 1 where P C S t 1 t t P W P C t 1 S t t 1 C t 1 S t 1 S t W …


View Full Document

CMU CS 10701 - agarwal_arora

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download agarwal_arora
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view agarwal_arora and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view agarwal_arora and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?