DOC PREVIEW
UMass Amherst CMPSCI 591N - Computational Linguistics

This preview shows page 1-2-19-20 out of 20 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Andrew McCallum, UMass AmherstCollocationsLecture #12Computational LinguisticsCMPSCI 591N, Spring 2006University of Massachusetts AmherstAndrew McCallumAndrew McCallum, UMass AmherstWords and their meaning• Word disambiguation– one word, multiple meanings• Word clustering– multiple words, “same” meaning• Collocations - this lecture– multiple words together, different meaning thanthan the sum of its parts– Simple measures on text, yielding interesting,insights into language, meaning, culture.Andrew McCallum, UMass AmherstToday’s Main Points• What is collocation?• Why do people care?• Three ways of finding them automatically.Andrew McCallum, UMass AmherstCollocations• An expression consisting of two or morewords that correspond to some conventionalway of saying things.• Characterized by limited compositionality.– compositional: meaning of expression can bepredicted by meaning of its parts.– “strong tea”, “rich in calcium”– “weapons of mass destruction”– “kick the bucket”, “hear it through the grapevine”Andrew McCallum, UMass AmherstCollocations important for…• Terminology extraction– Finding special phrases in technical domains• Natural language generation– To make natural output• Computational lexicography– To automatically identify phrases to be listed in a dictionary• Parsing– To give preference to parses with natural collocations• Study of social phenomena– Like the reinforcement of cultural stereotypes through language(Stubbs 1996)Andrew McCallum, UMass AmherstContextual Theory of Meaning• In contrast with “structural linguistics”, which emphasizesabstractions, properties of sentences• Contextual Theory of Meaning emphasizes the importanceof context– context of the social setting (not idealized speaker)– context of discourse (not sentence in isolation)– context of surrounding wordsFirth: “a word is characterized by the company it keeps”• Example [Halliday]– “strong tea”, coffee, cigarettes– “powerful drugs”, heroin, cocaine– Important for idiomatically correct English, but also socialimplications of language useAndrew McCallum, UMass AmherstMethod #1Frequency 80871 of the 58841 in the 26430 to the 21842 on the 21839 for the 18568 and the 16121 that the 15630 at the 15494 to be 13899 in a 13689 of a 13361 by the 13183 with the 12622 from the 11428 New York 10007 he saidAndrew McCallum, UMass AmherstMethod #1Frequency with POS FilterAN, NN, AAN, ANN, NAN, NNN, NPN 11487 New York A N 7261 United States A N 5412 Los Angeles A N 3301 last year N N 3191 Saudi Arabia N N 2699 last week A N 2514 vice president A N 2378 Persian Gulf A N 2161 San Francisco N N 2106 President Bush N N 2001 Middle East A N 1942 Saddam Hussein N N 1867 Soviet Union A N 1850 White House A N 1633 United Nations A N 1328 oil prices N N 1210 next year A N 1074 chief executive A N 1073 real estate A NAndrew McCallum, UMass AmherstMethod #2Mean and Variance• Some collocations are not of adjacent words,but words in more flexible distancerelationship– she knocked on his door– they knocked at the door– 100 women knocked on Donaldson’s door– a man knocked on the metal front door• Not a constant distance relationship• But enough evidence that “knock” is betterthan “hit”, “punch”, etc.Andrew McCallum, UMass AmherstMethod #2Mean and Variance• To ask about relationship between “stocks” and“crash”, gather many such pairs, and calculate themean and variance of their offset.Sentence: Stocks crash as rescue plan teeters.Time-shifted bigrams: 1 2 3stocks crash stocks as stocks rescuecrash as crash rescue crash planas rescue as plan as teeters...Andrew McCallum, UMass AmherstMethod #2Mean and Variance01020304050607080-4 -3 -2 -1 0 1 2 3 4Position of “strong” versus “opposition” (mean=-1.15, deviation=0.67)Andrew McCallum, UMass AmherstMethod #2Mean and Variance01020304050607080-4 -3 -2 -1 0 1 2 3 4Position of “strong” versus “support” (mean=-1.45, deviation=1.07)Andrew McCallum, UMass AmherstMethod #2Mean and Variance01020304050607080-4 -3 -2 -1 0 1 2 3 4Position of “strong” versus “for” (mean=-1.12, deviation=2.15)Andrew McCallum, UMass AmherstMethod #2Mean and Variancedev mean count Word1 Word20.43 0.97 11657 New York0.48 1.83 24 previous games0.15 2.98 46 minus points0.49 3.87 131 hundreds dollars4.03 0.44 36 editorial Atlanta4.03 0.00 78 ring New3.96 0.19 119 point hundredth3.96 0.29 106 subscribers byAndrew McCallum, UMass AmherstMethod #3Likelihood Ratios• Determine which of two probabilistic models is moreappropriate for the data.– H1 = hypothesis of model 1– H2 = hypothesis of model 2• Hypothesis 1: p(w2|w1) = p = p(w2|~w1)• Hypothesis 2: p(w2|w2) = p1 ≠ p2 = p(w2|~w1)• Data– N = total count of all words– c1 = count of word 1– c2 = count of word 2– c12 = count of bigram word1word2Andrew McCallum, UMass AmherstMethod #3Likelihood Ratios• Determine which of two probabilistic modelsis more appropriate for the data.b(c2-c12; N-c1,p2)b(c2-c12; N-c1, p)c2-c12 out of N-c1bigrams are ~w1w2b(c12;c1,p1)b(c12; c1,p)c12 out of c1 bigramsare w1w2p2=(c2-c12)/(N-c1)p=c2/NP(w2|~w1)p1=c12/c1p=c2/NP(w2|w1)H2H1Andrew McCallum, UMass AmherstMethod #3Likelihood Ratio example data-2log λ c1 c2 c12 w1 w2 1291 12593 932 150 most powerful 99 379 932 10 politically powerful 82 932 934 10 powerful computers 80 932 3424 13 powerful force 57 932 291 6 powerful symbol 51 932 40 4 powerful lobbies 51 171 932 5 economically powerful 51 932 43 4 powerful magnet 50 4458 932 10 less powerful 50 6252 932 11 very powerful 49 932 2064 8 powerful position 48 932 591 6 powerful machines 47 932 2339 8 powerful computer 43 932 396 5 powerful magnetsAndrew McCallum, UMass AmherstCollocation studies helping lexicography• Want to help dictionary-writers bring out differencesbetween “strong” and “powerful”– Understand meaning of a word by the company it keeps.• Church and Hanks (1989) through statistical analysisconcluded that it is a matter of intrinsic vs extrinsicquality• “strong” support from a demographic group, meanscommitted, but may not have capability.• “powerful” supporter is one who actually hascapability to change things.• But also additional subtleties, helps us analyzecultural attitudes– “strong tea” versus “powerful drugs”Andrew McCallum, UMass AmherstMethod #1“strong” versus “powerful”w


View Full Document

UMass Amherst CMPSCI 591N - Computational Linguistics

Download Computational Linguistics
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Computational Linguistics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Computational Linguistics 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?