View Full Document

Resource Acquisition for Syntax-based MT from Parsed Parallel data



View the full content.
View Full Document
View Full Document

5 views

Unformatted text preview:

Resource Acquisition for Syntax based MT from Parsed Parallel data Alon Lavie Alok Parlikar and Vamshi Ambati Language Technologies Institute Carnegie Mellon University Research Goals Long term research agenda since 2000 focused on developing a unified framework for MT that addresses the core fundamental weaknesses of previous approaches Representation explore richer formalisms that can capture complex divergences between languages Ability to handle morphologically complex languages Methods for automatically acquiring MT resources from available data and combining them with manual resources Ability to address both rich and poor resource scenarios Main research funding sources NSF AVENUE and LETRAS projects and DARPA GALE June 20 2008 2 SSST 2 CMU Statistical Transfer Stat XFER MT Approach Integrate the major strengths of rule based and statistical MT within a common framework Linguistically rich formalism that can express complex and abstract compositional transfer rules Rules can be written by human experts and also acquired automatically from data Easy integration of morphological analyzers and generators Word and syntactic phrase correspondences can be automatically acquired from parallel text Search based decoding from statistical MT adapted to find the best translation within the search space multi feature scoring beam search parameter optimization etc Framework suitable for both resource rich and resource poor language scenarios June 20 2008 SSST 2 3 Stat XFER MT Systems General Stat XFER framework under development for past seven years Systems so far Chinese to English Hebrew to English Urdu to English German to English French to English Hindi to English Dutch to English Mapudungun to Spanish In progress or planned Arabic to English Brazilian Portuguese to English Inupiaq to English Hebrew to Arabic Quechua to Spanish Turkish to English June 20 2008 SSST 2 4 Stat XFER Framework Source Input Preprocessing Language Weighted Model Features Morphology Transfer Rules Bilingual Lexicon June 20 2008 Transfer Engine Translation Lattice SSST 2 Second Stage Decoder Target Output 5 Source Input Transfer Rules NP1 3 NP1 NP1 NP1 H ADJ ADJ NP1 X3 Y1 X1 Y2 X1 def X1 status c absolute X1 num X3 num X1 gen X3 gen X0 X1 Preprocessing Morphology Transfer Engine Language Model Additional Features Translation Lexicon N N WR BULL X1 Y1 X0 NUM s Y0 lex BULL N N WRH LINE X1 Y1 X0 NUM s Y0 lex LINE Decoder Translation Output Lattice 0 1 IN PREP 1 1 THE DET 2 2 LINE N 1 2 THE LINE NP 0 2 IN LINE PP 0 4 IN THE NEXT LINE PP English Output in the next line Transfer Rule Formalism SL the old man TL ha ish ha zaqen Type information Part of speech constituent information Alignments NP NP X1 Y1 X1 Y3 X2 Y4 X3 Y2 x side constraints X1 AGR 3 SING X1 DEF DEF X3 AGR 3 SING X3 COUNT y side constraints Y1 DEF DEF Y3 DEF DEF Y2 AGR 3 SING Y2 GENDER Y4 GENDER xy constraints e g Y1 AGR X1 AGR June 20 2008 DET ADJ N DET N DET ADJ SSST 2 7 MT Resource Acquisition in Resource rich Scenarios Scenario Significant amounts of parallel text at sentencelevel are available Parallel sentences can be word aligned and parsed at least on one side ideally on both sides Goal Acquire both broad coverage translation lexicons and transfer rule grammars automatically from the data Syntax based translation lexicons Broad coverage constituent level translation equivalents at all levels of syntactic granularity Can serve as the elementary building blocks for transfer trees constructed at runtime using the transfer rules June 20 2008 8 SSST 2 Acquisition Process Automatic Process for Extracting Syntax driven Rules and Lexicons from sentence parallel data 1 2 3 4 5 6 Word align the parallel corpus GIZA Parse the sentences independently for both languages Run our new PFA Constituent Aligner over the parsed sentence pairs Extract all aligned constituents from the parallel trees Extract all derived synchronous transfer rules from the constituent aligned parallel trees Construct a data base of all extracted parallel constituents and synchronous rules with their frequencies and model them statistically assign them max likelihood probabilities June 20 2008 9 SSST 2 PFA Constituent Node Aligner Input a bilingual pair of parsed and word aligned sentences Goal find all sub sentential constituent alignments between the two trees which are translation equivalents of each other Equivalence Constraint a pair of constituents S T are considered translation equivalents if All words in yield of S are aligned only to words in yield of T and vice versa If S has a sub constituent S1 that is aligned to T1 then T1 must be a sub constituent of T and vice versa Algorithm is a bottom up process starting from word level marking nodes that satisfy the constraints June 20 2008 10 SSST 2 PFA Node Alignment Algorithm Each of the nodes stores a value All nodes are initialized with the value 1 Each Word to Word alignment is assigned a unique prime number PFA Node Alignment Algorithm For every word to word alignment we do the following Let p be the unique prime value assigned to the alignment Let ws and wt be the aligned words on the source and target side Assign the value p to the POS nodes corresponding to the words ws and wt Example Australia gets value 2 is gets value 3 PFA Node Alignment Algorithm In case there are one tomany alignments they are considered as multiple one to one alignments and all of these alignments are given the same prime value Example North Korea is just one word on Chinese side That word is assigned the value 25 which is a product 5 5 PFA Node Alignment Algorithm Once all the lexical items have values we propogate the values up the tree as follows Work bottom up A node updates its value as the product of the values of its children PFA Node Alignment Algorithm Once all the lexical items have values we propogate the values up the tree as follows Work bottom up A node updates its value as the product of the values of its children Values could become large PFA Node Alignment Algorithm Once all nodes have values they can be aligned as follows If a node on Chinese side has a value same as node on English side align them If two nodes have equal values take the node at lowest level in the tree PFA Node Alignment Algorithm Once all nodes have values they can be aligned as follows If a node on Chinese side has a value same as node on English side align them If two nodes have equal values take the node at lowest level in the tree PFA Node Alignment


Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view Resource Acquisition for Syntax-based MT from Parsed Parallel data and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Resource Acquisition for Syntax-based MT from Parsed Parallel data and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?