DOC PREVIEW
CU-Boulder CSCI 5417 - Lecture 22

This preview shows page 1-2-3-26-27-28 out of 28 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1 CSCI 5417 Information Retrieval Systems Jim Martin!Lecture 22 11/10/2011 11/11/11 CSCI 5417 - IR 2 Today 11/10  Wrap up Information Extraction  Start sentiment analysis2 11/11/11 CSCI 5417 - IR 3 IE vs. IR  Operationally, what IR usually comes down to is the retrieval of documents, not the retrieval of information. It’s up to a human to extract the needed information out of the text  IE is an attempt to automate the extraction of limited kinds of information from free texts  These days it’s often called text analytics  Sort of sits between NLP and IR 11/11/11 CSCI 5417 - IR 4 Information Extraction  So what is it exactly?  Figure out the entities (the players, props, instruments, locations, etc. in a text)  Figure out how they’re related to each other and to other entities  Figure out what they’re all up to  What events they’re taking part in  And extract information about sentiment and opinion  And do each of those tasks in a robust, loosely-coupled data-driven manner3 11/11/11 CSCI 5417 - IR 5 IE details  Going to run through 2 generic applications in more detail  NER  Relations  Most other applications are variants on these 2 11/11/11 CSCI 5417 - IR 6 NER  Find and classify all the named entities in a text.  What’s a named entity?  A mention of an entity using its name  Kansas Jayhawks  This is a subset of the possible mentions...  Kansas, Jayhawks, the team, it, they  Find means identify the exact span of the mention  Classify means determine the category of the entity being referred to4 11/11/11 CSCI 5417 - IR 7 NE Types 11/11/11 CSCI 5417 - IR 8 NE Types5 11/11/11 CSCI 5417 - IR 9 Ambiguity 11/11/11 CSCI 5417 - IR 10 NER Approaches  As with may tasks in IE there are two basic approaches (and hybrids)  Rule-based (regular expressions)  Lists of names  Patterns to match things that look like names  Patterns to match the environments that classes of names tend to occur in.  ML-based approaches  Get annotated training data  Extract features  Train systems to replicate the annotation6 11/11/11 CSCI 5417 - IR 11 ML Approach 11/11/11 CSCI 5417 - IR 12 Data Encoding for Sequence Labeling  In NER, we’re dealing with spans of texts that have been labeled as belonging to some class. So we need to encode  The class  The start of the span  The end of the span  In a way that is amenable to supervised ML classifiers  That is, here’s an object represented as a vector of feature/value pairs  Here’s the class that goes along with that vector7 Data Encoding for Sequence Labeling  The trick with sequences is to come up with an encoding that plays well with the typical classifier  Popular solution is treat the problem as a word-by-word tagging problem  Learn to assign a single tag to each word in a sequence  So the tags are the classifier output; the input is some representation of the word in context  The tag sequence captures the class, span start, and span finish 11/11/11 CSCI 5417 - IR 13 IOB Encoding  A popular way to do this is with IOB encoding. Ignoring classes, every word gets a tag of I (inside), O (outside), or B (begins) 11/11/11 CSCI 5417 - IR 14 American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. B I O O B O O O O O B I O8 IOB Encoding  If we’re trying to capture locations, persons, and organizations, we have 3 classes. So we can create, 3 kinds of B and three kinds of I, and leave O as is. That gives is 7 tags. 11/11/11 CSCI 5417 - IR 15 American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. B_org I_org O O B_org O O O O O B_per I_per O In general, for N classes, we wind up with 2*N+1 classes Training  So now those tags are the target classifier outputs. We have one object to be classified for each position (token) in the text.  The features associated with each position are based on  Facts based on the word at that position  Facts extracted from a window surrounding that position 11/11/11 CSCI 5417 - IR 169 11/11/11 CSCI 5417 - IR 17 NER Features The word itself Word class Grammatical chunk Capitalization 11/11/11 CSCI 5417 - IR 18 NER as Sequence Labeling10 11/11/11 CSCI 5417 - IR 19 Relations  Once you have captured the entities in a text you might want to ascertain how they relate to one another.  Here we’re just talking about explicitly stated relations 11/11/11 CSCI 5417 - IR 20 Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York11 11/11/11 CSCI 5417 - IR 21 Relation Types  As with named entities, the list of relations is application specific. For generic news texts... 11/11/11 CSCI 5417 - IR 22 Relations  By relation we really mean sets of tuples.  Think about populating a database.12 11/11/11 CSCI 5417 - IR 23 Relation Analysis  We can divide this task into two parts  Determining if 2 entities are related  And if they are, classifying the relation  The reason for doing this is two-fold  Cutting down on training time for classification by eliminating most pairs  Producing separate feature-sets that are appropriate for each task. 11/11/11 CSCI 5417 - IR 24 Features  We can group the features (for both tasks) into three categories  Features of the named entities involved  Features derived from the words between and around the named entities  Features derived from the syntactic environment that governs the two entities13 11/11/11 CSCI 5417 - IR 25 Features  Features of the entities  Their types  Concatenation of the types  Headwords of the entities  George Washington Bridge  Words in


View Full Document

CU-Boulder CSCI 5417 - Lecture 22

Download Lecture 22
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 22 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 22 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?