Unformatted text preview:

CSCI 5832 Natural Language Processing Lecture 2 Jim Martin 1 18 08 1 Today 1 17 Wrap up last time Knowledge of language Ambiguity Models and algorithms Generative paradigm Finite state methods 2 1 18 08 Course Material We ll be intermingling discussions of Linguistic topics E g Morphology syntax discourse structure Formal systems E g Regular languages context free grammars Applications E g Machine translation information extraction 3 1 18 08 1 Linguistics Topics Word level processing Syntactic processing Lexical and compositional semantics Discourse processing 4 1 18 08 Topics Techniques Finite state methods Context free methods Augmented grammars Unification Lambda calculus First order logic Probability models Supervised machine learning methods 5 1 18 08 Topics Applications Small Spelling correction Hyphenation Medium Word sense disambiguation Named entity recognition Information retrieval Large Question answering Conversational agents Machine translation Stand alone Enabling applications Funding Business plans 6 1 18 08 2 Just English The examples in this class will for the most part be English Only because it happens to be what I know This leads to an over emphasis on certain topics syntax to the detriment of others morphology due to the properties of English We ll cover other languages primarily in the context of machine translation 7 1 18 08 Commercial World Lot s of exciting stuff going on 8 1 18 08 Google Translate 9 1 18 08 3 Google Translate 10 1 18 08 Web Q A 11 1 18 08 Summarization Current web based Q A is limited to returning simple fact like factoid answers names dates places etc Multi document summarization can be used to address more complex kinds of questions Circa 2002 What s going on with the Hubble 12 1 18 08 4 NewsBlaster Example The U S orbiter Columbia has touched down at the Kennedy Space Center after an 11 day mission to upgrade the Hubble observatory The astronauts on Columbia gave the space telescope new solar wings a better central power unit and the most advanced optical camera The astronauts added an experimental refrigeration system that will revive a disabled infrared camera Unbelievable that we got everything we set out to do accomplished shuttle commander Scott Altman said Hubble is scheduled for one more servicing mission in 2004 13 1 18 08 Weblog Analytics Textmining weblogs discussion forums message boards user groups and other forms of user generated media Product marketing information Political opinion tracking Social network analysis Buzz analysis what s hot what topics are people talking about right now 14 1 18 08 Web Analytics 15 1 18 08 5 Categories of Knowledge Phonology Morphology Syntax Semantics Pragmatics Discourse Each kind of knowledge has associated with it an encapsulated set of processes that make use of it Interfaces are defined that allow the various levels to communicate This usually leads to a pipeline architecture 16 1 18 08 Ambiguity I made her duck 17 1 18 08 Ambiguity I made her duck Sources Lexical syntactic Part of speech Subcat Lexical semantic Syntactic Different parses 18 1 18 08 6 Dealing with Ambiguity Four possible approaches 1 Tightly coupled interaction among processing levels knowledge from other levels can help decide among choices at ambiguous levels 2 Pipeline processing that ignores ambiguity as it occurs and hopes that other levels can eliminate incorrect structures 19 1 18 08 Dealing with Ambiguity 3 Probabilistic approaches based on making the most likely choices 4 Don t do anything maybe it won t matter We ll leave when the duck is ready to eat The duck is ready to eat now Does the ambiguity matter 20 1 18 08 Models and Algorithms By models I mean the formalisms that are used to capture the various kinds of linguistic knowledge we need Algorithms are then used to manipulate the knowledge representations needed to tackle the task at hand 21 1 18 08 7 Models State machines Rule based approaches Logical formalisms Probabilistic models 22 1 18 08 Algorithms Many of the algorithms that we ll study will turn out to be transducers algorithms that take one kind of structure as input and output another Unfortunately ambiguity makes this process difficult This leads us to employ algorithms that are designed to handle ambiguity of various kinds 23 1 18 08 Paradigms In particular State space search To manage the problem of making choices during processing when we lack the information needed to make the right choice Dynamic programming To avoid having to redo work during the course of a statespace search CKY Earley Minimum Edit Distance Viterbi Baum Welch Classifiers Machine learning based classifiers that are trained to make decisions based on features extracted from the local context 24 1 18 08 8 State Space Search States represent pairings of partially processed inputs with partially constructed representations Goals are inputs paired with completed representations that satisfy some criteria As with most interesting problems the spaces are normally too large to exhaustively explore We need heuristics to guide the search Criteria to trim the space 25 1 18 08 Dynamic Programming Don t do the same work over and over Avoid this by building and making use of solutions to sub problems that must be invariant across all parts of the space 26 1 18 08 Administrative Stuff Mailing list If you re registered you re on it with your CU account I sent out mail this morning Check to see if you ve received it The textbook is now in the bookstore 27 1 18 08 9 First Assignment Two parts 1 Answer the following question How many words do you know 2 Write a python program that takes a newspaper article plain text that I will provide and returns the number of Words Sentences Paragraphs 28 1 18 08 First Assignment Details For the first part I want An actual number and a explanation of how you arrived at the answer Hardcopy Bring to class For the second part email me your code and your answers to the test text that I will send out shortly before the HW is due 29 1 18 08 First Assignment In doing this assignment you should think ahead having access to the words sentences and paragraphs will be useful in future assignments 30 1 18 08 10 Getting Going The next two lectures will cover material from Chapters 2 and 3 Finite state automata Finite state transducers English morphology 31 1 18 08 Regular Expressions and Text Searching Everybody does it Emacs vi perl grep etc Regular expressions are a compact textual representation of a set of


View Full Document

CU-Boulder CSCI 5832 - Lecture 2

Loading Unlocking...
Login

Join to view Lecture 2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?