DOC PREVIEW
CORNELL CS 674 - Leg Segmentaton

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to KanjiRie Kubota and Lillian LeeCornell UniversityJapanese NLP• Words/Characters are unspaced, so segmentation is an essential first step• Current methods employ:– Pre-existing lexicon– Pre-existing grammar– Pre-segmented data• English parallel: “theyouthevent”Japanese Language• 3 Types of Characters– kanji, hiragana, katakana– Are used within the same document, sentence, etc. (helps find <60% word boundaries)– The latter 2 often represent sounds (like English characters)Kanji• Are often:– Domain terms or Proper nouns (unknown word problem, important for IR)– Compound nouns (POS doesn’t help)• >3 characters are often >1 wordWhat’s Coming in this Paper?• Use of statistical analysis only, no language• No rules specific to Japanese• Requires very few (>=5) labeled training examples• Requires large amounts of unsegmented data• For long kanji strings, performance rivals current morphological modelsHow it WorksCalculates n-gram frequency over training corpusIs [#(si) > #(tj)] ?How it Works (N=4)A B C D W X Y ZIs [#(si) > #(tj)] ?There are 5 4-grams in this sequence. With grouping, there are 2 X 3 = 6 greater-than expressions to evaluateHow it WorksSelect which integers n ∈Ν, for calculations of n-grams,do math, then determine word boundaries.Experimental Methods• Data from 150 MB Nikkei newswire 1993• Pick 5 Held-out sets. Each…– 50 random chosen kanji sequences of length >=10 in length (12 on avg)• Annotate held-out sets. Divide each into a parameter-training (50) and test (450) set18151010 12 12… >= 500450 50Segmenting Rules• Word level – 1 word: (prefix+word+suffix)• Morpheme level – 3 words: (prefix)(word)(suffix)• 3 people had 98.42% agreement, all disagreement at morpheme levelMethods• Morphological algorithms to compare to:– have access to lexicons of size 115,000 and 231,000.– used training data by adding it to their lexicons• Parameters for the current methodN = power set {2-6}l = .05k | 0 <= k <= 20 Evaluation• Precision: “percentage of proposed brackets that exactly match word-level brackets in the annotation”= (# brackets right)/(#brackets proposed)• Recall: “percentage of word-level annotation brackets that are proposed by the algorithm= (# brackets right)/(#actual brackets)• F-measure = 2PR / (P + R)Segmentation ResultsIncompatible? Use New Metrics• Crossing Bracket – “a proposed bracket that overlaps but is not contained within an annotation bracket”• Morpheme Dividing Bracket – “subdivides a morpheme level annotation bracket”• Compatible Brackets – neither of the above• All-Compatible Brackets – sequence ratio of all correctResults with new Metrics Discussion – Manual Effort• Required Annotation– only the 50-sequence held-out sets (42min)– other methods require 1000-190,000 sentences• Authors had some success with as few as only 5 sequences (4min)My Thoughts• Purely Statistical Models are New• This could work for other languages (Chinese), but would it do English well?• The ‘>’ heuristic: “conjecture that using absolute differences may have an adverse effect”Summary• Purely Statistical Model– No lexicon or grammar• Good Performance– Almost as good as, if not better than, other systems•New


View Full Document
Download Leg Segmentaton
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Leg Segmentaton and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Leg Segmentaton 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?