Unformatted text preview:

Getting Started on Natural Language Processing with Python Nitin Madnani nmadnani umiacs umd edu Note This is a completely revised version of the article that was originally published in ACM Crossroads Volume 13 Issue 4 Revisions were needed because of major changes to the Natural Language Toolkit project The code in this version of the article will always conform to the very latest version of NLTK v2 0b5 as of September 2009 Although the code is always tested it is possible that a bug or two may have been introduced in the code during the course of this revision If you find any please report them to the author If you are still using version 0 7 of the toolkit for some reason please refer to http www acm org crossroads xrds13 4 natural language html 1 Motivation The intent of this article is to introduce the readers to the area of Natural Language Processing commonly referred to as NLP However rather than just describing the salient concepts of NLP this article uses the Python programming language to illustrate them as well For readers unfamiliar with Python the article provides a number of references to learn how to program in Python 2 2 1 Introduction Natural Language Processing The term Natural Language Processing encompasses a broad set of techniques for automated generation manipulation and analysis of natural or human languages Although most NLP techniques inherit largely from Linguistics and Artificial Intelligence they are also influenced by relatively newer areas such as Machine Learning Computational Statistics and Cognitive Science Before we see some examples of NLP techniques it will be useful to introduce some very basic terminology Please note that as a side effect of 1 keeping things simple these definitions may not stand up to strict linguistic scrutiny Token Before any real processing can be done on the input text it needs to be segmented into linguistic units such as words punctuation numbers or alphanumerics These units are known as tokens Sentence An ordered sequence of tokens Tokenization The process of splitting a sentence into its constituent tokens For segmented languages such as English the existence of whitespace makes tokenization relatively easier and uninteresting However for languages such as Chinese and Arabic the task is more difficult since there are no explicit boundaries Furthermore almost all characters in such non segmented languages can exist as one character words by themselves but can also join together to form multi character words Corpus A body of text usually containing a large number of sentences Part of speech POS Tag A word can be classified into one or more of a set of lexical or part of speech categories such as Nouns Verbs Adjectives and Articles to name a few A POS tag is a symbol representing such a lexical category NN Noun VB Verb JJ Adjective AT Article One of the oldest and most commonly used tag sets is the Brown Corpus tag set We will discuss the Brown Corpus in more detail below Parse Tree A tree defined over a given sentence that represents the syntactic structure of the sentence as defined by a formal grammar Now that we have introduced the basic terminology let s look at some common NLP tasks POS Tagging Given a sentence and a set of POS tags a common language processing task is to automatically assign POS tags to each word in the sentences For example given the sentence The ball is red the output of a POS tagger would be The AT ball NN is VB red JJ State of the art POS taggers 9 can achieve accuracy as high as 96 Tagging text with parts of speech turns out to be extremely useful for more complicated NLP tasks such as parsing and machine translation which are discussed below Computational Morphology Natural languages consist of a very large number of words that are built upon basic building blocks known 2 as morphemes or stems the smallest linguistic units possessing meaning Computational morphology is concerned with the discovery and analysis of the internal structure of words using computers Parsing In the parsing task a parser constructs the parse tree given a sentence Some parsers assume the existence of a set of grammar rules in order to parse but recent parsers are smart enough to deduce the parse trees directly from the given data using complex statistical models 1 Most parsers also operate in a supervised setting and require the sentence to be POS tagged before it can be parsed Statistical parsing is an area of active research in NLP Machine Translation MT In machine translation the goal is to have the computer translate the given text in one natural language to fluent text in another language without any human in the loop This is one of the most difficult tasks in NLP and has been tackled in a lot of different ways over the years Almost all MT approaches use POS tagging and parsing as preliminary steps 2 2 Python The Python programming language is a dynamically typed object oriented interpreted language Although its primary strength lies in the ease with which it allows a programmer to rapidly prototype a project its powerful and mature set of standard libraries make it a great fit for large scale production level software engineering projects as well Python has a very shallow learning curve and an excellent online learning resource 11 2 3 Natural Language Toolkit Although Python already has most of the functionality needed to perform simple NLP tasks it s still not powerful enough for most standard NLP tasks This is where the Natural Language Toolkit NLTK comes in 12 NLTK is a collection of modules and corpora released under an opensource license that allows students to learn and conduct research in NLP The most important advantage of using NLTK is that it is entirely selfcontained Not only does it provide convenient functions and wrappers that can be used as building blocks for common NLP tasks it also provides raw and pre processed versions of standard corpora used in NLP literature and courses 3 3 Using NLTK The NLTK website contains excellent documentation and tutorials for learning to use the toolkit 13 It would be unfair to the authors as well as to this publication to just reproduce their words for the sake of this article Instead I will introduce NLTK by showing how to perform four NLP tasks in increasing order of difficulty Each task is either an unsolved exercise from the NLTK tutorial or a variant thereof Therefore the solution and analysis of each task represents original content written solely


View Full Document

UMD CMSC 723 - Getting Started on Natural Language Processing with Python

Documents in this Course
Load more
Loading Unlocking...
Login

Join to view Getting Started on Natural Language Processing with Python and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Getting Started on Natural Language Processing with Python and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?