CSCI 5832 Natural Language Processing Jim Martin Lecture 22 4 10 08 1 Today 4 10 More on IE Chapter 22 2 4 10 08 1 IE Overview Named entity recognition and classification Coreference analysis Temporal and numerical expression analysis Event detection and classification Relation extraction Template analysis 3 4 10 08 IE Overview In case it doesn t become totally obvious This chapter is just a series of reuses of existing techniques to solve specific problems Partial parsing chunking Cascades ML sequence labeling Classification ambiguity resolution 4 4 10 08 2 NER Find and classify all the named entities in a text What s a named entity A mention of an entity using its name Kansas Jayhawks This is a subset of the possible mentions Kansas Jayhawks the team it they Find means identify the exact span of the mention Classify means determine the category of the entity being referred to 5 4 10 08 NE Types 6 4 10 08 3 NE Types 7 4 10 08 Ambiguity 8 4 10 08 4 NER Approaches As with partial parsing and chunking there are two basic approaches and hybrids Rule based regular expressions Lists of names Patterns to match things that look like names Patterns to match the environments that classes of names tend to occur in ML based approaches Get annotated training data Extract features Train systems to replicate the annotation 9 4 10 08 ML Approach 10 4 10 08 5 Encoding for Sequence Labeling We can use the same IOB encoding here that we used for chunking For N classes we have 2 N 1 tags An I and B for each class and a O for outside any class Each token in a text gets a tag 11 4 10 08 NER Features 12 4 10 08 6 NER as Sequence Labeling 13 4 10 08 NER Evaluation As with chunking it is a bad idea to evaluation sequence labelers at the tag level Most labels are O so just guessing O gives a learning algorithm a lot of credit So we need to evaluation P R F at the entity level But we may not care equally about all kinds of entities So we might weight them differently in the evaluation routine 14 4 10 08 7 Relations Once you have captured the entities in a text you might want to ascertain how they relate to one another Here we re just talking about explicitly stated relations 15 4 10 08 Information Extraction CHICAGO AP Citing high fuel prices United Airlines said Friday it has increased fares by 6 per round trip on flights to some cities also served by lower cost carriers American Airlines a unit AMR immediately matched the move spokesman Tim Wagner said United a unit of UAL said the increase took effect Thursday night and applies to most routes where it competes against discount carriers such as Chicago to Dallas and Atlanta and Denver to San Francisco Los Angeles and New York 16 4 10 08 8 Relation Types As with named entities the list of relations is application specific For generic news texts 17 4 10 08 Relations By relation we really mean sets of tuples Think about populating a database 18 4 10 08 9 Relation Analysis As with semantic role labeling we can divide this task into two parts Determining if 2 entities are related And if they are classifying the relation The reason for doing this is two fold Cutting down on training time for classification by eliminating most pairs Producing separate feature sets that are appropriate for each task 19 4 10 08 Relation Analysis Let s just worry about named entities within the same sentence We ll come back to this when we discuss coreference next week 20 4 10 08 10 Features We can group the features for both tasks into three categories Features of the named entities involved Features derived from the words between and around the named entities Features derived from the syntactic environment that governs the two entities 21 4 10 08 Features Features of the entities Their types Concatenation of the types Headwords of the entities George Washington Bridge Words in the entities Features between and around Particular positions to the left and right of the entities 1 2 3 Bag of words between 22 4 10 08 11 Features Syntactic environment Constituent path through the tree from one to the other Base syntactic chunk sequence from one to the other Dependency path 23 4 10 08 Example For the following example we re interested in the possible relation between American Airlines and Tim Wagner American Airlines a unit AMR immediately matched the move spokesman Tim Wagner said 24 4 10 08 12 Bootstrapping Approaches What if you don t have enough annotated text to train on But you might have some seed tuples Or you might have some patterns that work pretty well Can you use those seeds to do something useful Co training and active learning use the seeds to train classifiers to tag more data to train better classifiers Bootstrapping tries to learn directly populate a relation through direct use of the seeds 25 4 10 08 Bootstrapping Example Seed Tuple Mark Twain Elmira Seed tuple Grep google Mark Twain is buried in Elmira NY X is buried in Y The grave of Mark Twain is in Elmira The grave of X is in Y Elmira is Mark Twain s final resting place Y is X s final resting place Use those patterns to grep for new tuples that you don t already know 26 4 10 08 13 Bootstrapping Relations 27 4 10 08 Template Filling For stories texts with stereotypical sequences of events participants props etc Represent these facts as slots and slotfillers templates frames scripts schemas Evoke the right template Identify the story elements that fill each slot 28 4 10 08 14 Airline Example 29 4 10 08 Template Filling Two approaches Cascades of transducers Ala Fastus Supervised ML as Sequence Labeling Two approaches One seq classifier per slot One big sequence classifier 30 4 10 08 15 Bioinformatic NLP An example domain Very important Practitioners care about the technology They have problems they re trying to solve Lots and lots of text available Lots of interesting problems 31 4 10 08 Lots of Text 32 4 10 08 16 Problem Areas Mainly variants of NER and relation analysis NER Detecting and classifying named entities And also normalization Mapping that named entity to a particular entity in some external database or ontology Relation analysis How various biological entities interact 33 4 10 08 Bio NER Large number of fairly specific types Wide really wide variation in the naming of entities Gene names White insulin BRCA1 ether a go go breast cancer associated 1 etc 34 4 10 08 17 Bio NER Types 35 4 10 08 Bio Relations Combination of IE and SRL style relation analysis 36 4 10 08 18 Bioinformatic IE Much work in NLP
View Full Document
Unlocking...