MIT 6 863J - Words- The Building Blocks of Language - D469978

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 863J> Words- The Building Blocks of Language

DOC PREVIEW

MIT 6 863J - Words- The Building Blocks of Language

School name Massachusetts Institute of Technology

Course 6 863j- Natural Language and the Computer Representation of Knowledge

Pages 32

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Words: The Building Blocks of LanguageIntroductionTokens, Types and TextsExtracting Text from FilesExtracting Text from the WebExtracting Text from NLTK CorporaExercisesText Processing with UnicodeWhat is Unicode?Extracting encoded text from filesUsing your local encoding in PythonChinese and XMLExercisesTokenization and NormalizationTokenization with Regular ExpressionsLemmatization and NormalizationTransforming ListsExercisesCounting Words: Several Interesting ApplicationsFrequency DistributionsStylisticsAside: Defining FunctionsLexical DispersionComparing Word Lengths in Different LanguagesGenerating Random Text with StyleCollocationsExercisesWordNet: An English Lexical DatabaseSenses and SynonymsThe WordNet HierarchyWordNet SimilarityExercisesConclusionSummaryFurther ReadingChapter 3Words: The Building Blocks of Language3.1 IntroductionLanguage can be divided up into pieces of varying sizes, ranging from morphemes to paragraphs. Inthis chapter we will focus on words, the most fundamental level for NLP. Just what are words, and howshould we represent them in a machine? These questions may seem trivial, but we’ll see that there aresome important issues involved in defining and representing words. Once we’ve tackled them, we’re ina good position to do further processing, such as find related words and analyze the style of a text (thischapter), to categorize words (Chapter 4), to group them into phrases (Chapter 7 and Part II), and to doa variety of language engineering tasks (Chapter 5).In the following sections, we will explore the division of text into words; the distinction betweentypes and tokens; sources of text data including files, the web, and linguistic corpora; accessing thesesources using Python and NLTK; stemming and normalization; the WordNet lexical database; and avariety of useful programming tasks involving words.NoteFrom this chapter onwards, our program samples will assume you begin yourinteractive session or your program with: import nltk, re, pprint3.2 Tokens, Types and TextsIn Chapter 1, we showed how a string could be split into a list of words. Once we have derived a list,the len() function will count the number of words it contains:>>> sentence = "This is the time -- and this is the record of the time.">>> words = sentence.split()>>> len(words)13This process of segmenting a string of characters into words is known as tokenization. Tokenizationis a prelude to pretty much everything else we might want to do in NLP, since it tells our processingsoftware what our basic units are. We will discuss tokenization in more detail shortly.We also pointed out that we could compile a list of the unique vocabulary items in a string by usingset() to eliminate duplicates:13.2. Tokens, Types and Texts>>> len(set(words))10So if we ask how many words there are in sentence, we get different answers depending on whetherwe count duplicates. Clearly we are using different senses of “word” here. To help distinguish betweenthem, let’s introduce two terms: token and type. A word token is an individual occurrence of a word ina concrete context; it exists in time and space. A word type is a more abstract; it’s what we’re talkingabout when we say that the three occurrences of the in sentence are “the same word.”Something similar to a type-token distinction is reflected in the following snippet of Python:>>> words[2]’the’>>> words[2] == words[8]True>>> words[2] is words[8]False>>> words[2] is words[2]TrueThe operator == tests whether two expressions are equal, and in this case, it is testing for string-identity. This is the notion of identity that was assumed by our use of set() above. By contrast, the isoperator tests whether two objects are stored in the same location of memory, and is therefore analogousto token-identity. When we used split() to turn a string into a list of words, our tokenizationmethod was to say that any strings that are delimited by whitespace count as a word token. But thissimple approach doesn’t always give the desired results. Also, testing string-identity isn’t a very usefulcriterion for assigning tokens to types. We therefore need to address two questions in more detail:Tokenization: Which substrings of the original text should be treated as word tokens? Type definition:How do we decide whether two tokens have the same type?To see the problems with our first stab at defining tokens and types in sentence, let’s look at theactual tokens we found:>>> set(words)set([’and’, ’this’, ’record’, ’This’, ’of’, ’is’, ’--’, ’time.’, ’time’, ’the’])Observe that ’time’ and ’time.’ are incorrectly treated as distinct types since the trailing periodhas been bundled with the rest of the word. Although’--’ is some kind of token, it’s not a wordtoken. Additionally, ’This’ and ’this’ are incorrectly distinguished from each other, because of adifference in capitalization that should be ignored.If we turn to languages other than English, tokenizing text is even more challenging. In Chinesetext there is no visual representation of word boundaries. Consider the following three-character string:1ýº (in pinyin plus tones: ai4 “love” (verb), guo3 “country”, ren2 “person”). This could either besegmented as [1ý]º, “country-loving person” or as 1[ýº], “love country-person.”The terms token and type can also be applied to other linguistic entities. For example, a sentencetoken is an individual occurrence of a sentence; but a sentence type is an abstract sentence, withoutcontext. If I say the same sentence twice, I have uttered two sentence tokens but only used one sentencetype. When the kind of token or type is obvious from context, we will simply use the terms token andtype.To summarize, we cannot just say that two word tokens have the same type if they are the samestring of characters. We need to consider a variety of factors in determining what counts as the sameword, and we need to be careful in how we identify tokens in the first place.January 24, 2008 2 Bird, Klein & Loper3. Words: The Building Blocks of Language Introduction to Natural Language Processing (DRAFT)Up till now, we have relied on getting our source texts by defining a string in a fragment of Pythoncode. However, this is impractical for all but the simplest of texts, and makes it hard to present realisticexamples. So how do we get larger chunks of text into our programs? In the rest of this section,

View Full Document