Machine Translation Discriminative Word AlignmentGenerative Alignment ModelsDiscriminative Word AlignmentTasks2005 – Year of DWAYang Liu et al. 2005More FeaturesSearchMoore 2005TrainingModeling Alignment with CRFModeling Alignment MatrixSlide 14Probability of AlignmentFeaturesLocal FeaturesFertility FeaturesFirst Order FeaturesInference – Finding the Best AlignmentBelief PropagationGetting the ProbabilitySlide 23Some Results: Spanish-EnglishSummaryMachine TranslationDiscriminative Word AlignmentStephan VogelSpring Semester 2011Stephan Vogel - Machine Translation) 2Generative Alignment ModelsGenerative word alignment models: P(f, a|e) = …Alignment a as hidden variableActual word alignment is not observedSum over all alignmentsWell-known IBM 1 … 5 models, HMM, ITGModel lexical association, distortion, fertilityIt is difficult to incorporate additional informationPOS of words (used in distortion model, not as direct link features)Manual dictionarySyntax information…Stephan Vogel - Machine Translation) 3Discriminative Word AlignmentModel alignment directly: p(a | f, e)Find alignment that maximizes p(a | f, e) Well-suited framework: maximum entropySet of feature functions hm(a, f, e), m = 1, …, MSet of model parameters (feature weights) cm, m = 1, …, MDecision rule:Stephan Vogel - Machine Translation) 5TasksModeling: design feature functions which capture cross-lingual divergencesSearch: find alignment with highest probabilityTraining: find optimal feature weightsMinimize alignment errors given some gold-standard alignments(Notice: Alignments no longer hidden!) Supervised training, i.e. we evaluate against gold standardNotice: features functions may result from some training procedure themselvesE.g. use statistical dictionary resulting from IBMn alignment, trained on large corpusHere now additional training step, on small (hand-aligned) corpus(Similar to MERT for decoder)Stephan Vogel - Machine Translation) 62005 – Year of DWAYang Liu, Qun Liu, and Shouxun Lin. 2005.Loglinear Models for Word Alignment.Abraham Ittycheriah and Salim Roukos. 2005.A Maximum Entropy Word Aligner for Arabic-English Machine Translation.Ben Taskar, Simon Lacoste-Julien, and Dan Klein. 2005.A Discriminative Matching Approach to Word Alignment.Robert C. Moore. 2005.A Discriminative Framework for Bilingual Word Alignment.Necip Fazil Ayan, Bonnie J. Dorr, and Christof Monz. 2005.NeurAlign: Combining Word Alignments Using Neural Networks.Stephan Vogel - Machine Translation) 7Yang Liu et al. 2005Start out with features used in generative alignmentLexicons E.g. IBM1Use both directions: p(fj|ei) and p(ei|fj), =>Symmetrical alignment modelAnd/or symmetric modelFertility model: p(i|ei)Stephan Vogel - Machine Translation) 8More FeaturesCross count: number of crossings in alignmentNeighbor count: count the number of links in the immediate neighborhoodExact match: count number of src/tgt pairs, where src=tgtLinked word count: total number of links (to influence density)Link types: count how many 1-1, 1-m, m-1, n-m alignmentsSibling distance: if word is aligned to multiple words, add the distance between these aligned wordsLink Co-occurrence count: given multiple alignments (e.g. Viterbi alignments from IBM models) count how often links co-occurStephan Vogel - Machine Translation) 9SearchGreedy search based on gain by adding a linkFor each of the features the gain can be calculatedE.g. IBM1Algorithm:Start with empty alignmentLoop until no addition gain Loop over all (j,i) not in set if gain(j,i) > best_gain then store as (j’,i’) Set link(j’,i’)Stephan Vogel - Machine Translation) 10Moore 2005Log-Likelihood-based modelMeasure word association strengthValues can get largeConditional-Link-Probability-basedEstimated probability of two words being linkedUsed simpler alignment model to establish linksAdd simple smoothingAdditional features: one-to-one, one-to-many, non-monotonicityStephan Vogel - Machine Translation) 11TrainingFinding optimal alignment is non-trivialAdding link can affect nonmonotonicity, one-to-many featuresDynamic programming does not workBeam search could be usedRequires pruningParameter optimizationModified version of average perceptron learning)),,(),,(( feahfeahrefirefiiiStephan Vogel - Machine Translation) 12Modeling Alignment with CRFCRF is an undirected graphical modelEach vertex (node) represents a random variable whose distribution is to be inferredEach edge represents a dependency between two random variablesThe distribution of each discrete random variable Y in the graph is conditioned on an input sequence X. Cliques: set of nodes in graph fully connectedIn our case:Features derived from source and target words are the input sequence XAlignment links are the random variables YDifferent ways to model alignmentBlunsom & Cohn (2006): many-to-one word alignments, where each source word is aligned with zero or one target words (-> asymmetric)Niehues & Vogel (2008): model not sequence, but entire alignment matrix(->symmetric)Stephan Vogel - Machine Translation) 13Modeling Alignment MatrixRandom variables yji for all possible alignment links2 values: 0/1 – word in position j is not linked/linked to word in position iRepresented as nodes in a graphStephan Vogel - Machine Translation) 14Modeling Alignment MatrixFactored nodes x representing features (observables)Linked to random variablesDefine potential for each yjiStephan Vogel - Machine Translation) 15Probability of Alignment))(exp(1 ))(exp(1 )(1)|(pfunction potential-))(exp()(ctor weight vea- vectorfeature a -)(clique) (a nodes connected ofset a - nodes factored ofset - FNFNFNVcccVcccVccccccccccFNVFZVFZVZxyVFVVFVVStephan Vogel - Machine Translation) 16FeaturesLocal features, e.g. lexical, POS, …Fertility featuresFirst-order features: capturing relation between linksPhrase-features: interaction between word and phrase alignmentStephan Vogel - Machine Translation) 17Local FeaturesLocal information about link probabilityFeatures derived from positions j and i onlyFactored node connected to only one random variableFeaturesLexical probabilities, also
View Full Document