CMU LTI 11731 - MT-Class-110214-Decoder-2 - D2753522

Home> Schools> Carnegie Mellon University> Language Technologies Institute (LTI) > LTI 11731> MT-Class-110214-Decoder-2

CMU LTI 11731 - MT-Class-110214-Decoder-2

Pages 34

Download Save

Unformatted text preview:

Machine Translation Decoder for Phrase-Based SMTDecoderRecombination of HypothesesRecombination of Hypotheses: ExampleRecombination of Hypotheses: Example 2Slide 7Slide 8PruningPruning: Which Hyps to Compare?Slide 13How Many Hyps to Keep?Additive BeamMultiplicative BeamPruning and OptimizationSlide 18EfficiencySlide 20Naïve WayEarly Termination‘Cube’ PruningEffect of Recombination and PruningNumber of Hypotheses versus NISTN-Best List GenerationStoring Multiple BackpointersCalculating True ScoreProblem with N-Best GenerationRest-Cost EstimationRest Cost for Translation ModelsRest Cost for Language ModelsRest Cost for Distance-Based DMRest Cost for Lexicalized DMEffect of Rest-Cost EstimationSummarySlide 37Stephan Vogel - Machine Translation 1Machine TranslationDecoder for Phrase-Based SMTStephan VogelSpring Semester 2011Stephan Vogel - Machine Translation 2DecoderDecoding issues (Previous Session)Two step decodingGeneration of translation latticeBest path searchWith limited word reorderingSpecific IssuesRecombination of hypothesesPruningN-best list generationFuture cost estimationStephan Vogel - Machine Translation 3Recombination of HypothesesRecombination: Of two hypotheses keep only the better one if no future information can switch their current rankingNotice: this depends on the modelsModel score depends on current partial translation and the extension, e.g. LMModel score depends on global features known only at the sentence end, e.g. sentence length modelThe models define equivalence classes for the hypothesesExpand only best hypothesis in each equivalence classStephan Vogel - Machine Translation 4Recombination of Hypotheses: Examplen-gram LMHypothesesH1: I would like to goH2: I would not like to goAssume as possible expansions:to the movies | to the cinema | and watch a filmLMscore is identical for H1+Expansion as for H2+Expansion for bi, tri, four-gram LMsE.g : 3-gram LMscore Expansion 1 is:-log p( to | to go ) – log p( the | go to ) – log p( movies | to the)Therefore: Cost(H1) < Cost(H2) => Cost(H1+E) < Cost(H2+E)for all possible expansions EStephan Vogel - Machine Translation 5Recombination of Hypotheses: Example 2Sentence length model p( I | J )HypothesisH1: I would like to goH2: I would not like to goAssume as possible expansions:to the movies | to the cinema | and watch a filmLength( H1 ) = 5, Length( H2 ) = 6For identical expansions the lengths will remain differentSituation at sentence endPossible that -log P( len( H1 + E ) | J ) > -log P( len( H2 + E ) | J )Then possible that TotalCost( H1 + E ) > TotalCost( H2 + E )I.e. reranking of hypothesesTherefore: can not recombine H2 into H1Stephan Vogel - Machine Translation 7Recombination: Keep ‘em aroundExpand only best hypStore pointers to recombined hyps for n-best list generationhbhbhrhrhrhrBetterIncreasing coverageStephan Vogel - Machine Translation 8Recombination of HypothesesTypical features for recombination of partial hypothesesLM historyPositions of covered source words – some translations are more expensiveNumber of generated words on target side – for sentence length modelOften only number of covered source words is considered, rather then actual positionsFits with typical organization of decoder: hyps are stored according to number of covered source wordsHyps are recombined which are not strictly comparableUse future cost estimate to lessen its impactOverall: trade-off between speed and ‘correctness’ of searchIdeally: only compare (and recombine) hyps if all models used in the search see them as equivalentRealistically: use fewer, coarser equivalence classes by ‘forgetting’ some of the models (they still add to the scores)Stephan Vogel - Machine Translation 11PruningPruningEven after recombination too many hypsRemove bad hyps and keep only the best onesIn recombination we compared hyps which are equivalent under the modelsNow we need to compare hyps, which are not strictly equivalent under the modelsWe risk to remove hyps which would have won the race in the long runI.e. we introduce errors into the searchSearch Error – Model ErrorsModel errors: our models give higher probability to worse translationSearch errors: our decoder looses translations with higher probabilityStephan Vogel - Machine Translation 12Pruning: Which Hyps to Compare?Which hyps are we comparing?How many should we keep?RecombinationPruningStephan Vogel - Machine Translation 13Pruning: Which Hyps to Compare?Coarser equivalence relation => need to drop at least one of the models, or replace by simpler modelRecombination according to translated positions and LM statePruning according to number of translated positions and LM stateRecombination according to number of translated positions and LM statePruning according to number of translated positions OR LM stateRecombination with 5-gram LMPruning with 3-gram LMQuestion: which is the more important feature?Which leads to more search errors?How much loss in translation quality?Quality more important than speed in most applications!Not one correct answer – depends on other components of the systemIdeally, decoder allows for different recombination and pruning settingsStephan Vogel - Machine Translation 14How Many Hyps to Keep?Beam search: keep hyp h if Cost(h) < Cost(hbest) + constCostModels separate alternatives a lot-> keep few hypsModels do not separate alternatives-> keep many hyps# translated wordsPrune badhypsStephan Vogel - Machine Translation 15Additive BeamIs additive constant (in log domain) the right thing to do?Hyps may spread more and moreCostFewer and fewer hypsInside beam# translated wordsStephan Vogel - Machine Translation 16Multiplicative BeamBeam search: keep hyp h if Cost(h) < Cost(hbest) * constCost# translated wordsOpening beamCovers more hypsStephan Vogel - Machine Translation 17Pruning and OptimizationEach feature has a feature weightOptimization by adjusting feature weightsCan result in compressing or spreading the scoresThis actually happened in our first MERT implementation:Higher and higher feature weights=> Hyps spreading further and further appart => Fewer hyps inside the beam=> Lower and lower Bleu score Two-pronged repair:Normalizing feature weightsNot proper beam pruning, but restricting

View Full Document


School:
Email:
New Password:
Confirm Password:

CMU LTI 11731 - MT-Class-110214-Decoder-2

Sign up for free to view:

Please select your school