CORNELL CS 674 - Study Notes - D404122

Home> Schools> Cornell University> Computer Science (CS) > CS 674> Study Notes

DOC PREVIEW

CORNELL CS 674 - Study Notes

School name Cornell University

Course Cs 674- Advanced Language Techologies

Pages 4

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Comparing a Linguistic and a Stochastic TaggerChrister SamuelssonAtro VoutilainenComparing a Linguistic and a Stochastic Tagger Compares HMM to EngCG-2 HMM is statistical, EngCG is based on hand-coded linguistic rules An attempt to allay fears of bias in previous EngCG results Original EngCG reported 99.7% correct analysis (with some small ambiguity remaining) The validity of these results was questioned Skeptics say: Even human linguists can only agree 97% of the time – how can a machine get 99% accuracy? Test corpus may be biased towards high performance for EngCG EngCG tag set may be very basic, making POS tagging easy Low error rate may be due to high remaining ambiguityBackground:How Does EngCG Work? Sequentially applied modules Morphological Analyser Assigns all possible POS tags to words, e.g. Heuristics determine possible POS tags of unseen words Disambiguator Remove illegitimate analyses. Can leave ambiguity! Optionally, application-specific heuristics / statistical disambiguators for still ambiguous words"free" A ABS"free" <SVO> V SUBJUNCTIVE VFIN"free" <SVO> V IMP VFIN"free" <SVO> V INF"free" <SVO> V PRES -SG3 VFIN"<free>"Background:How Does EngCG Work? The Disambiguator Multiple passes (5 subgrammars) Starts with very reliable rules, e.g. Proceeds to rough heuristics; error rates increase to 10% - 30% in final two subgrammarsREMOVE (V)(-1C DET) ; 99.5596.54445% Correct Remain% Extra Senses Removed# RulesSubgramar99.7195.7471499.8594.42374399.8692.87158299.8891.7029671Issue One: Maximum Accuracy Samuelsson and Voutilainen believe inter-linguist agreement can approach 100% In creating benchmark corpus, agreement between two experts was measured at 99.3% before corrections After correction of simple errors, agreement reached 99.96% Two special approaches in their case EngCG tag set avoids semantically-motivated tags Linguists have “Grammarian’s Manual” of most common ambiguous cases and their correct resolution Using statistical tests, can determine that there is a 95% chance that human evaluators agree more than 99.2% of the time on average (in these conditions)Issue Two: Bias in Corpora Because the paper’s focus is on unbiased comparison, the methods used to create corpora are especially important Two corpora were used: Training corpus: 357,000-word sample from Brown corpus Used to train HMM Test corpus: 55,000-word sample of journalistic, scientific, and manual texts No subject overlap with training corpus Helps EngCG?The Training Corpus Training corpus annotated with EngCG tags First pass was original EngCG algorithm Ambiguities resolved by expert Used in testing EngCG-2; continually improved as new rules were tested and deployed Does this lead to a bias favoring EngCG?The Training Corpus Training corpus annotated with EngCG tags First pass was original EngCG algorithm Ambiguities resolved by expert Used in testing EngCG-2; continually improved as new rules were tested and deployed Does this lead to a bias favoring EngCG? If tagged by EngCG, sets an upper bound on how well HMM can perform Imagine if EngCG were only 50% accurate – HMM could never do better than 50% However, a standard practice in NLP Given many iterations of testing and correction, most incorrect classifications were most likely weeded outThe Test Corpus First analyzed using only morphological analyzer Independently disambiguated by two linguists Agreement reached 99.96% After correcting for clerical errors, only disagreement was on 21 words (out of 55,000) genuinely ambiguous at the meaning level Final “consensus corpus” made from one of two disambiguated versionsIssue Three: Simple Tagset Idea is that EngCG performs so well only because the tagset it uses is so simple as to make annotating copora trivial While one can’t compare tagsets directly, their relative “difficulty” can be compared by training the same algorithm with two different tagsets and comparing error rates In this case, the HMM model’s performance with the EngCG tagset was compared to its performance with more common tagsets, and was found to be similarIssue Four:Ambiguity / Accuracy Tradeoff Could be that EngCG performs so well only because of ambiguity that remains in POS assignments Can’t be disproven without forcing EngCG to fully disambiguate Rather than removing ambiguity from EngCG, authors decided to allow it in HMM When annotating with HMM, allow tags with probabilities over a certain threshold to be assigned to the word in addition to the most probable tag By varying threshold, vary allowable ambiguity So, can set HMM ambiguity equivalent to EngCG ambiguity Issues? HMM was not designed to work this way May not take advantage of allowed ambiguity as much as EngCGExperiment First, test HMM on Brown corpus at various training set sizes Hold back 35,000 words from training corpus Train HMM on successively larger chunks of remaining words, evaluating on held back subset Main experiment HMM: Train HMM on full 357,000 words Test on 55,000 word test corpus at varying levels of allowable ambiguity EngCG: Run on entire training corpus at varying levels of ambiguity (number of subgrammars used) Compare HMM and EngCG at same ambiguity levelsResults: HMM Testing Learning curve of HMM with respect to training set size Paper states “has leveled off at 322,000 words, indicating that little is to be gained from further training” Has it? Remember Scaling to Very Very Large Corpora for Natural Language DisambiguationResults: Algorithm Comparison EngCG dominates at comparable ambiguity levels EngCG’s error rate ranges from 8.6 to 28 times smaller than HMM’s However, HMM’s performance also 1% lower than when training / testing on subset of Brown corpus Indicates that training HMM on a larger corpus –and/or one that included documents similar to the benchmark corpus - could improve performanceDiscussion Caveats of EngCG Vastly more work to create However, (Chanod and Tapanainen 1995) suggest that, given a limited amount of time to create both an HMM and a constraint-based system, the constraint-based system still outperforms the HMM Does not disambiguate fully, and therefore unsuitable for some tasks Could be corrected for by using

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 4 pages.

CORNELL CS 674 - Study Notes

Sign up for free to view:

Please select your school