Resource Acquisition for Syntax-based MT from Parsed Parallel data

Home> Academic Documents> Resource Acquisition for Syntax-based MT from Parsed Parallel data

DOC PREVIEW

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Research GoalsCMU Statistical Transfer (Stat-XFER) MT ApproachStat-XFER MT SystemsStat-XFER FrameworkSlide 6Transfer Rule FormalismMT Resource Acquisition in Resource-rich ScenariosAcquisition ProcessPFA Constituent Node AlignerPFA Node Alignment AlgorithmPFA Node Alignment AlgorithmPFA Node Alignment AlgorithmPFA Node Alignment AlgorithmPFA Node Alignment AlgorithmPFA Node Alignment AlgorithmPFA Node Alignment AlgorithmPFA Node Alignment AlgorithmPFA Node Alignment AlgorithmPFA Node Alignment AlgorithmPFA Constituent Node Alignment PerformancePFA Constituent Node Alignment PerformancePFA Constituent Node Alignment PerformanceTransfer Rule LearningRule Extraction AlgorithmRule Extraction AlgorithmRule Extraction AlgorithmRule Extraction AlgorithmRule Extraction AlgorithmChinese-English Rule LearningTranslation ExampleExample: XFER RulesCurrent and Future WorkConclusionsResource Acquisition for Syntax-based MT from Parsed Parallel dataAlon Lavie, Alok Parlikar and Vamshi AmbatiLanguage Technologies InstituteCarnegie Mellon UniversityResearch Goals•Long-term research agenda (since 2000) focused on developing a unified framework for MT that addresses the core fundamental weaknesses of previous approaches:–Representation – explore richer formalisms that can capture complex divergences between languages–Ability to handle morphologically complex languages–Methods for automatically acquiring MT resources from available data and combining them with manual resources–Ability to address both rich and poor resource scenarios•Main research funding sources: NSF (AVENUE and LETRAS projects) and DARPA (GALE)June 20, 2008 2 SSST-2June 20, 2008 SSST-2 3CMU Statistical Transfer (Stat-XFER) MT Approach•Integrate the major strengths of rule-based and statistical MT within a common framework:–Linguistically rich formalism that can express complex and abstract compositional transfer rules–Rules can be written by human experts and also acquired automatically from data–Easy integration of morphological analyzers and generators–Word and syntactic-phrase correspondences can be automatically acquired from parallel text–Search-based decoding from statistical MT adapted to find the best translation within the search space: multi-feature scoring, beam-search, parameter optimization, etc.–Framework suitable for both resource-rich and resource-poor language scenariosJune 20, 2008 SSST-2 4Stat-XFER MT Systems •General Stat-XFER framework under development for past seven years•Systems so far:–Chinese-to-English–Hebrew-to-English–Urdu-to-English–German-to-English–French-to-English–Hindi-to-English–Dutch-to-English–Mapudungun-to-Spanish•In progress or planned:–Arabic-to-English–Brazilian Portuguese-to-English–Inupiaq-to-English–Hebrew-to-Arabic–Quechua-to-Spanish–Turkish-to-EnglishStat-XFER FrameworkSourceInputPreprocessingMorphologyTransferEngineTransferRulesBilingualLexicon TranslationLatticeSecond-StageDecoderLanguageModelWeightedFeaturesTargetOutputJune 20, 2008 5SSST-2Transfer EngineLanguage Model + Additional FeaturesTransfer Rules{NP1,3}NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]((X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1))Translation LexiconN::N |: ["$WR"] -> ["BULL"]((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "BULL"))N::N |: ["$WRH"] -> ["LINE"]((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "LINE"))Source Input האבה הרושבDecoderEnglish Outputin the next lineTranslation Output Lattice(0 1 "IN" @PREP)(1 1 "THE" @DET)(2 2 "LINE" @N)(1 2 "THE LINE" @NP)(0 2 "IN LINE" @PP)(0 4 "IN THE NEXT LINE" @PP)PreprocessingMorphologyJune 20, 2008 SSST-2 7Transfer Rule FormalismType informationPart-of-speech/constituent informationAlignmentsx-side constraintsy-side constraintsxy-constraints, e.g. ((Y1 AGR) = (X1 AGR));SL: the old man, TL: ha-ish ha-zaqenNP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))MT Resource Acquisition in Resource-rich Scenarios•Scenario: Significant amounts of parallel-text at sentence-level are available–Parallel sentences can be word-aligned and parsed (at least on one side, ideally on both sides)•Goal: Acquire both broad-coverage translation lexicons and transfer rule grammars automatically from the data•Syntax-based translation lexicons:–Broad-coverage constituent-level translation equivalents at all levels of syntactic granularity–Can serve as the elementary building blocks for transfer trees constructed at runtime using the transfer rulesJune 20, 2008 8 SSST-2Acquisition Process•Automatic Process for Extracting Syntax-driven Rules and Lexicons from sentence-parallel data:1. Word-align the parallel corpus (GIZA++)2. Parse the sentences independently for both languages3. Run our new PFA Constituent Aligner over the parsed sentence pairs4. Extract all aligned constituents from the parallel trees5. Extract all derived synchronous transfer rules from the constituent-aligned parallel trees6. Construct a “data-base” of all extracted parallel constituents and synchronous rules with their frequencies and model them statistically (assign them max-likelihood probabilities)June 20, 2008 9 SSST-2PFA Constituent Node Aligner•Input: a bilingual pair of parsed and word-aligned sentences•Goal: find all sub-sentential constituent alignments between the two trees which are translation equivalents of each other•Equivalence Constraint: a pair of constituents <S,T> are considered translation equivalents if:–All words in yield of <S> are aligned only to words in yield of <T> (and vice-versa)–If <S> has a sub-constituent <S1> that is aligned to <T1>, then <T1> must be a sub-constituent of <T> (and vice-versa) •Algorithm is a bottom-up process starting from word-level, marking nodes that satisfy the constraintsJune 20, 2008 10 SSST-2PFA Node Alignment AlgorithmEach of the nodes stores a value. All nodes are initialized with the value 1. Each Word to Word alignment is assigned a unique prime number.PFA Node Alignment AlgorithmFor every word to word alignment, we do the following:• Let p be the unique prime value assigned to the alignment.• Let ws and wt be the aligned words on the source and target side.• Assign the value p to the POS nodes corresponding to the words ws and wt .•


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Please select your school