Machine Translation Decoder for Phrase-Based SMTDecoderRecombination of HypothesesRecombination of Hypotheses: ExampleRecombination of Hypotheses: Example 2Slide 7Slide 8PruningPruning: Which Hyps to Compare?Slide 13How Many Hyps to Keep?Additive BeamMultiplicative BeamPruning and OptimizationSlide 18EfficiencySlide 20Naïve WayEarly Termination‘Cube’ PruningEffect of Recombination and PruningNumber of Hypotheses versus NISTN-Best List GenerationStoring Multiple BackpointersCalculating True ScoreProblem with N-Best GenerationRest-Cost EstimationRest Cost for Translation ModelsRest Cost for Language ModelsRest Cost for Distance-Based DMRest Cost for Lexicalized DMEffect of Rest-Cost EstimationSummarySlide 37Stephan Vogel - Machine Translation 1Machine TranslationDecoder for Phrase-Based SMTStephan VogelSpring Semester 2011Stephan Vogel - Machine Translation 2DecoderDecoding issues (Previous Session)Two step decodingGeneration of translation latticeBest path searchWith limited word reorderingSpecific IssuesRecombination of hypothesesPruningN-best list generationFuture cost estimationStephan Vogel - Machine Translation 3Recombination of HypothesesRecombination: Of two hypotheses keep only the better one if no future information can switch their current rankingNotice: this depends on the modelsModel score depends on current partial translation and the extension, e.g. LMModel score depends on global features known only at the sentence end, e.g. sentence length modelThe models define equivalence classes for the hypothesesExpand only best hypothesis in each equivalence classStephan Vogel - Machine Translation 4Recombination of Hypotheses: Examplen-gram LMHypothesesH1: I would like to goH2: I would not like to goAssume as possible expansions:to the movies | to the cinema | and watch a filmLMscore is identical for H1+Expansion as for H2+Expansion for bi, tri, four-gram LMsE.g : 3-gram LMscore Expansion 1 is:-log p( to | to go ) – log p( the | go to ) – log p( movies | to the)Therefore: Cost(H1) < Cost(H2) => Cost(H1+E) < Cost(H2+E)for all possible expansions EStephan Vogel - Machine Translation 5Recombination of Hypotheses: Example 2Sentence length model p( I | J )HypothesisH1: I would like to goH2: I would not like to goAssume as possible expansions:to the movies | to the cinema | and watch a filmLength( H1 ) = 5, Length( H2 ) = 6For identical expansions the lengths will remain differentSituation at sentence endPossible that -log P( len( H1 + E ) | J ) > -log P( len( H2 + E ) | J )Then possible that TotalCost( H1 + E ) > TotalCost( H2 + E )I.e. reranking of hypothesesTherefore: can not recombine H2 into H1Stephan Vogel - Machine Translation 7Recombination: Keep ‘em aroundExpand only best hypStore pointers to recombined hyps for n-best list generationhbhbhrhrhrhrBetterIncreasing coverageStephan Vogel - Machine Translation 8Recombination of HypothesesTypical features for recombination of partial hypothesesLM historyPositions of covered source words – some translations are more expensiveNumber of generated words on target side – for sentence length modelOften only number of covered source words is considered, rather then actual positionsFits with typical organization of decoder: hyps are stored according to number of covered source wordsHyps are recombined which are not strictly comparableUse future cost estimate to lessen its impactOverall: trade-off between speed and ‘correctness’ of searchIdeally: only compare (and recombine) hyps if all models used in the search see them as equivalentRealistically: use fewer, coarser equivalence classes by ‘forgetting’ some of the models (they still add to the scores)Stephan Vogel - Machine Translation 11PruningPruningEven after recombination too many hypsRemove bad hyps and keep only the best onesIn recombination we compared hyps which are equivalent under the modelsNow we need to compare hyps, which are not strictly equivalent under the modelsWe risk to remove hyps which would have won the race in the long runI.e. we introduce errors into the searchSearch Error – Model ErrorsModel errors: our models give higher probability to worse translationSearch errors: our decoder looses translations with higher probabilityStephan Vogel - Machine Translation 12Pruning: Which Hyps to Compare?Which hyps are we comparing?How many should we keep?RecombinationPruningStephan Vogel - Machine Translation 13Pruning: Which Hyps to Compare?Coarser equivalence relation => need to drop at least one of the models, or replace by simpler modelRecombination according to translated positions and LM statePruning according to number of translated positions and LM stateRecombination according to number of translated positions and LM statePruning according to number of translated positions OR LM stateRecombination with 5-gram LMPruning with 3-gram LMQuestion: which is the more important feature?Which leads to more search errors?How much loss in translation quality?Quality more important than speed in most applications!Not one correct answer – depends on other components of the systemIdeally, decoder allows for different recombination and pruning settingsStephan Vogel - Machine Translation 14How Many Hyps to Keep?Beam search: keep hyp h if Cost(h) < Cost(hbest) + constCostModels separate alternatives a lot-> keep few hypsModels do not separate alternatives-> keep many hyps# translated wordsPrune badhypsStephan Vogel - Machine Translation 15Additive BeamIs additive constant (in log domain) the right thing to do?Hyps may spread more and moreCostFewer and fewer hypsInside beam# translated wordsStephan Vogel - Machine Translation 16Multiplicative BeamBeam search: keep hyp h if Cost(h) < Cost(hbest) * constCost# translated wordsOpening beamCovers more hypsStephan Vogel - Machine Translation 17Pruning and OptimizationEach feature has a feature weightOptimization by adjusting feature weightsCan result in compressing or spreading the scoresThis actually happened in our first MERT implementation:Higher and higher feature weights=> Hyps spreading further and further appart => Fewer hyps inside the beam=> Lower and lower Bleu score Two-pronged repair:Normalizing feature weightsNot proper beam pruning, but restricting
View Full Document