Machine Translation Minimum Error Rate TrainingOverviewTuning the SMT SystemProblemsBrute Force Approach – Manual TuningAutomatic TuningAutomatic Tuning on N-best ListSimplex (Nelder-Mead)DemoExpansion and ContractionChanging the SimplexPowell Line SearchMinimum Error TrainingSlide 14Slide 15Slide 16Slide 17Slide 18Iterate Decoding - OptimizationAvoiding Local MinimaRandom RestartsOptimizing NOT Towards ReferencesOptimizing Towards Different MetricsGeneralization to other Test SetsLarge Weight = Important Feature?Open IssuesSummaryStephan Vogel - Machine Translation 1Stephan VogelSpring Semester 2011Machine TranslationMinimum Error Rate TrainingStephan Vogel - Machine Translation 2OverviewOptimization approachesSimplexMERAvoiding local minimaAdditional considerationsTuning towards different metricsTuning on different development setsStephan Vogel - Machine Translation 3Tuning the SMT SystemWe use different models in SMT systemModels have simplificationsTrained on different amounts of data=> Models have different levels of reliability and scores have different ranges=> Give different weight to different ModelsQ = c1 Q1 + c2 Q2 + … + cn QnFind optimal scaling factors (feature weights) c1 … cnOptimal means: Highest score for chosen evaluation metric Mie: find (c1, …, cn) such that M(argmine{Q(e,f)}) is highMetric M is our objective functionStephan Vogel - Machine Translation 4ProblemsThe surface of the objective function is not niceNot convex -> local minima (actually, many local minima)Not differentiable -> gradient descent methods not readily applicableThere may be dangerousareas (‘boundary cliffs’)Example:Tune on Dev set withshort reference translationsOptimization leads towardsshort translationsNew test set has long reference translationsTranslations are now too short ->length penaltySmall changeBig effectStephan Vogel - Machine Translation 5Brute Force Approach – Manual TuningDecode with different scaling factorsGet feeling for range of good valuesGet feeling for importance of modelsLM is typically most importantSentence length (word count feature) to balance shortening effect of LMWord reordering is more or less effective depending on languageNarrow down range in which scaling factors are testedEssentially multi-linear optimizationWorks good for small number of modelsTime consuming (CPU wise) if decoding takes long timeStephan Vogel - Machine Translation 6Automatic TuningMany algorithms to find (near) optimal solutions availableSimplexPowell (line search)MIRA (Margin Infused Relaxed Algorithm)Specially designed minimum error training (Och 2003)Genetic algorithmNote: models are not improved, only their combinationNote: some parameters change performance of decoder, but are not in QNumber of alternative translationBeam sizeWord reordering restrictionsStephan Vogel - Machine Translation 7Automatic Tuning on N-best ListOptimization algorithm need many iterations – too expensive to run full translations=> Use n-best listse.g. for each of 500 source sentences 1000 translationsChange scaling factors results in re-ranking the n-best listsEvaluate new 1-best translationsApply any of the standard optimization techniquesAdvantage: much fasterCan pre-calculate the counts (e.g. n-gram matches) for each translation to speed up evaluationStephan Vogel - Machine Translation 8Simplex (Nelder-Mead)Start with n+1 random configurationsGet 1-best translation for each configuration -> objective functionSort points xk according to objective function:f(x1) < f(x2) < … < f(xn+1)Calculate x0 as center of gravity for x1 … xnReplace worst point with a point reflected through the centroidxr = x0 + r * (x0 – xn+1)Stephan Vogel - Machine Translation 9DemoObviously, we need to change the size of the simplex to enforce convergenceAlso, want to adjust the step sizeIf new point is best point – increase step sizeIf new point is worse then x1 … xn – decrease step size119127869Stephan Vogel - Machine Translation 10Expansion and ContractionReflection:Calculate xr = x0 + r * (x0 – xn+1)if f(x1) <= f(xr) < f(xn) replace xn+1 with xr; Next iterationExpansion:If reflected point is better then best, i.e. f(xr) < f(x1) Calculate xe = x0 + e * (x0 – xn+1) If f(xe) < f(xr) then replace xn+1 with xe else replace xn+1 with xr Next iterationelse ContractContraction:Reflected point f(xr) >= f(xn)Calculate xc = xn+1 + c * (x0 – xn+1)If f(xc) <= f(xn+1) then replace xn+1 with xc else ShrinkShrinking:For all xk, k = 2 … n+1: xk = x1 + s * (xk – x1)Next iterationStephan Vogel - Machine Translation 11Changing the Simplexxn+1x1reflectionxn+1x0expansionxn+1x0contractionxn+1x0shrinkingStephan Vogel - Machine Translation 12Powell Line SearchSelect directions in search space, thenLoop until convergence Loop over directions d Perform line search for direction d until convergenceMany variantsSelect directionsEasiest is to use the model scoresOr combine multiple scoresStep size in line search MER (Och 2003) is line search along models with smart selection of stepsStephan Vogel - Machine Translation 13Minimum Error TrainingFor each hypothesis we haveQ = ck*QkSelect oneQ\k = ck Qk + n\k cn*Qn = ck Qk + QRestckMetric ScoreWER = 8TotalModelScoreQRestQkIndividual model scoregives slope1Stephan Vogel - Machine Translation 14Minimum Error TrainingSource sentence 1Depending on scaling factor ck, different hyps are in 1-best positionSet ck to have metric-best hyp also being model-bestckh11: WER = 8h12 : WER = 5h13 : WER = 4best hyp:h11h12h138 5 4ModelScoreStephan Vogel - Machine Translation 15Minimum Error TrainingSelect minimum number of evaluation pointsCalculate intersection pointKeep only if hyps are minimum at that pointChoose evaluation points between intersection pointsckh11: WER = 8h12 : WER = 5h13 : WER = 4best hyp:h11h12h138 5 4ModelScoreStephan Vogel - Machine Translation 16Minimum Error TrainingSource sentence 1, now different error scoresOptimization would find a different ck=> Different metrics lead to different scaling factorsckModelScoreh11: WER = 8h12 : WER = 2h13 : WER = 4best hyp:h11h12h138 2 4Stephan Vogel - Machine Translation 17Minimum Error TrainingSentence 2Best ck in a different rangeNo
View Full Document