DOC PREVIEW
CMU CS 10701 - agarwal_arora

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Context Based Word Prediction for Texting1 Language Sachin Agarwal Shilpa Arora Language Technologies Institute Language Technologies Institute School of Computer Science School of Computer Science Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA 15213 Pittsburgh, PA 15213 [email protected] [email protected] Abstract The use of digital mobile phones has led to a tremendous increase in communication using SMS. On a phone keypad, multiple words are mapped to same numeric code. We propose a Context Based Word Prediction system for SMS messaging in which context is used to predict the most appropriate word for a given code. We extend this system to allow informal words (short forms for proper English words). The mapping from informal word to its proper English words is done using Double Metaphone Encoding based on their phonetic similarity. The results show good improvement over the traditional frequency based word estimation. 1 Introduction The growth of wireless technology has provided us with many new ways of communication such as SMS (Short Message Service). SMS messaging can also be used to interact with automated systems, such as ordering products and services for mobile phones, or participating in contests. With tremendous increase in Mobile Text Messaging, there is a need for an efficient text input system. With limited keys on the mobile phone, multiple letters are mapped to same number (8 keys, 2 to 9, for 26 alphabets). The many to one mapping of alphabets to numbers gives us same numeric code for multiple words. Predictive text systems in place use the frequency-based disambiguation method and predict the most commonly used words above other possible words. T-9 (Text on 9-keys)[1], developed by Tegic Communications, is one such predictive text technology used by LG, Siemens, Nokia Sony Ericson and others in their phones. iTap is another similar system developed and used by Motorola in their phones. T-9 system predicts the correct word for a given numeric code based on frequency. This may not give us the correct result most of the time. For example, for code ‘63’, two possible words are ‘me’ and ‘of’. Based on a frequency list where ‘of’ is more likely than ‘me’, T-9 system will always predict ‘of’ for code ‘63’. So, for a sentence like ‘Give me a box of chocolate’, the prediction would be ‘Give of a box of chocolate’. The sentence itself indeed gives us information about what should be the correct word for a given code. Consider the above sentence with blanks, “Give _ a box _ chocolate”. According to the English grammar, it is more likely that ‘of’ comes after a 1 SMS Text languagenoun ‘box’ than ‘me’ i.e. it is more likely to see the phrase “box of” than “box me”. The algorithm proposed is an online method that uses this knowledge to correctly predict the word for a given code considering its previous context. An extension of T-9 system called T-12 was proposed by UzZaman et. al [2]. They extend the idea of T-9 to what we call informal language which is used in Text messaging a lot. This includes abbreviation, acronyms, short forms of the words based on phonetic similarity (e.g. gr8 for great). They use the Metaphone Encoding [3] technique to find phonetically similar words. And from among those phonetically similar words, they choose the appropriate word using string matching algorithms such as edit distance between the word and its normalized form. However, the edit distance measure suggests words such as ‘create’ for informal word ‘gr8’. In the proposed method, the context information is used to choose the appropriate word. 2 Problem Statement The current systems for word prediction in Text Messaging predict the word for a code based on its frequency obtained from a huge corpus. However, the word at a particular position in a sentence depends on its context and this intuition motivated us to use Machine Learning algorithms to predict a word, based on its context. The system also takes into consideration the proper English words for the codes corresponding to the words in informal language. Although the method has been proposed for a text messaging system, it is applicable in a number of other domains as well. The informal and formal language mixture discussed here is also used in instant messaging and emails. The proposed method can also be used to convert a group of documents in informal language into formal language. These days, even (non-personal) discussions over emails/IM between friends, colleagues, students is done in a more informal language but if someone were to make use of these discussions formally, then our system can automatically do the conversion or suggest appropriate conversions. 3 Proposed Method The proposed method uses machine learning algorithms to predict the current word given its code and previous word’s part of speech (POS). The workflow of the system is as shown in Figure 1. The algorithm predicts the current word after training a Markov Model using Enron email corpus [4] since short emails resemble SMS messages closely. Figure 1: Workflow for Context Based Word Prediction System for formal lanaguge The code, word and its POS are three random variables in the model. The dependency relationship between these variables can be modeled in different ways. The graphical models with different representations of this relationship are discussed below. The bi-gram model is used to predict the most probable word and POS pair given its code and previous word’s POS. Valid Dictionary Words code Code to Word Mapper Dictionary Word Predictor Context Most Probable wordI. First Model Figure 2: Graphical Model used for Context based word prediction In this model, word is dependent on its code and the part of speech is dependent on the word and part of speech of previous word. Here, Ct refers to the numeric code for tth word in a sentence. Wt refers to tth word in a sentence and St refers to the


View Full Document

CMU CS 10701 - agarwal_arora

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download agarwal_arora
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view agarwal_arora and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view agarwal_arora 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?