UMD CMSC 723 - Getting Started on Natural Language Processing with Python - D600877

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 723> Getting Started on Natural Language Processing with Python

UMD CMSC 723 - Getting Started on Natural Language Processing with Python

School name University of Maryland, College Park

Course Cmsc 723- Computational Linguistics I

Pages 16

Download Save

Unformatted text preview:

Getting Started on Natural LanguageProcessing with PythonNitin [email protected](Note: This is a completely revised version of the article that was originallypublished in ACM Crossroads, Volume 13, Issue 4. Revisions were neededbecause of major changes to the Natural Language Toolkit project. The codein this version of the article will always conform to the very latest version ofNLTK (v2.0b5 as of September 2009). Although the code is always tested, itis possible that a bug or two may have been introduced in the code duringthe course of this revision. If you find any, please report them to the author.If you are still using version 0.7 of the toolkit for some reason, please refer tohttp://www.acm.org/crossroads/xrds13-4/natural_language.html).1 MotivationThe intent of this article is to introduce the readers to the area of Natu-ral Language Processing, commonly referred to as NLP. However, ratherthan just describing the salient concepts of NLP, this article uses the Pythonprogramming language to illustrate them as well. For readers unfamiliarwith Python, the article provides a number of references to learn how toprogram in Python.2 Introduction2.1 Natural Language ProcessingThe term Natural Language Processing encompasses a broad set of techniquesfor automated generation, manipulation and analysis of natural or humanlanguages. Although most NLP techniques inherit largely from Linguis-tics and Artificial Intelligence, they are also influenced by relatively newerareas such as Machine Learning, Computational Statistics and CognitiveScience.Before we see some examples of NLP techniques, it will be useful tointroduce some very basic terminology. Please note that as a side effect of1keeping things simple, these definitions may not stand up to strict linguisticscrutiny.• Token: Before any real processing can be done on the input text, itneeds to be segmented into linguistic units such as words, punctua-tion, numbers or alphanumerics. These units are known as tokens.• Sentence: An ordered sequence of tokens.• Tokenization: The process of splitting a sentence into its constituenttokens. For segmented languages such as English, the existence ofwhitespace makes tokenization relatively easier and uninteresting.However, for languages such as Chinese and Arabic, the task is moredifficult since there are no explicit boundaries. Furthermore, almostall characters in such non-segmented languages can exist as one-characterwords by themselves but can also join together to form multi-characterwords.• Corpus: A body of text, usually containing a large number of sen-tences.• Part-of-speech (POS) Tag: A word can be classified into one or moreof a set of lexical or part-of-speech categories such as Nouns, Verbs,Adjectives and Articles, to name a few. A POS tag is a symbol repre-senting such a lexical category - NN(Noun), VB(Verb), JJ(Adjective),AT(Article). One of the oldest and most commonly used tag sets isthe Brown Corpus tag set. We will discuss the Brown Corpus in moredetail below.• Parse Tree: A tree defined over a given sentence that represents thesyntactic structure of the sentence as defined by a formal grammar.Now that we have introduced the basic terminology, let’s look at some com-mon NLP tasks:• POS Tagging: Given a sentence and a set of POS tags, a commonlanguage processing task is to automatically assign POS tags to eachword in the sentences. For example, given the sentence The ball isred, the output of a POS tagger would be The/AT ball/NN is/VB red/JJ.State-of-the-art POS taggers [9] can achieve accuracy as high as 96%.Tagging text with parts-of-speech turns out to be extremely useful formore complicated NLP tasks such as parsing and machine translation,which are discussed below.• Computational Morphology: Natural languages consist of a verylarge number of words that are built upon basic building blocks known2as morphemes (or stems), the smallest linguistic units possessing mean-ing. Computational morphology is concerned with the discovery andanalysis of the internal structure of words using computers.• Parsing: In the parsing task, a parser constructs the parse tree givena sentence. Some parsers assume the existence of a set of grammarrules in order to parse but recent parsers are smart enough to deducethe parse trees directly from the given data using complex statisticalmodels [1]. Most parsers also operate in a supervised setting and re-quire the sentence to be POS-tagged before it can be parsed. Statisticalparsing is an area of active research in NLP.• Machine Translation (MT): In machine translation, the goal is to havethe computer translate the given text in one natural language to fluenttext in another language without any human in the loop. This is oneof the most difficult tasks in NLP and has been tackled in a lot ofdifferent ways over the years. Almost all MT approaches use POStagging and parsing as preliminary steps.2.2 PythonThe Python programming language is a dynamically-typed, object-orientedinterpreted language. Although, its primary strength lies in the ease withwhich it allows a programmer to rapidly prototype a project, its power-ful and mature set of standard libraries make it a great fit for large-scaleproduction-level software engineering projects as well. Python has a veryshallow learning curve and an excellent online learning resource [11].2.3 Natural Language ToolkitAlthough Python already has most of the functionality needed to performsimple NLP tasks, it’s still not powerful enough for most standard NLPtasks. This is where the Natural Language Toolkit (NLTK) comes in [12].NLTK is a collection of modules and corpora, released under an open-source license, that allows students to learn and conduct research in NLP.The most important advantage of using NLTK is that it is entirely self-contained. Not only does it provide convenient functions and wrappersthat can be used as building blocks for common NLP tasks, it also providesraw and pre-processed versions of standard corpora used in NLP literatureand courses.33 Using NLTKThe NLTK website contains excellent documentation and tutorials for learn-ing to use the toolkit [13]. It would be unfair to the authors, as well as tothis publication, to just reproduce their words for the sake of this article. In-stead, I will introduce NLTK by showing how to perform four NLP tasks, inincreasing order of difficulty. Each task is either an unsolved exercise fromthe NLTK tutorial or a variant

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 723 - Getting Started on Natural Language Processing with Python

Sign up for free to view:

Please select your school