DOC PREVIEW
Stanford CS 224 - Lecture Notes

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

NER_1NER_2CS 224NUsing Named Entity Recognitionto improveMachine TranslationNeeraj AgrawalAnkush SinglaAbstract !Named Entities (NEs) are a very important part of a sentence and its important for a Machine Translation (MT) system to get them right. Mistranslating or dropping NEs may not change a sentence’s BLEU score by much, but it can hurt sentence’s human readability considerably. Current MT systems consider NE’s as normal words and subsequently drop or mistranslate them during the translation. In our paper, we present three different approaches to help these systems treat NE’s differently and help improve the human readability overall. Introduction Our focus is on Chinese to English translation. We start with the current Stanford Machine Translation system. Treating the Stanford system as a reference standard, we experiment with using Named Entity Recognition (NER) for improving its performance. We experimented with three different ideas of using NER data: adding named entity data to help the aligner, modifying the language model, and giving higher weights to translations with same number of Named Entities (NE) in source and target. We discuss each of these ideas in detail, algorithms used and the motivation for choosing them. We also provide an in-depth analysis of the performance, highlighting the improvements as well as point out the cases where performance decreased. Previous Work Most of the published work that uses NER for MT has been directed towards learning to transliterate NEs. The work done by Ulf Hermjakob and Kevin Knight et. al. [1] for Arabic-English translation shows that improvement in MT can be achieved by transliterating NEs in source data (instead of trying to translate them). It builds on the hypothesis that MT system drops or mistranslates NEs when they do not occur in the training data. Another paper from Bogdan Babych et. al. [2] uses a simpler approach for European languages. It creates a “Do Not Translate” list based on NER and simply copies any word in this list from source to target without any translation or transliteration. Their analysis shows that ‘well-formedness’ of sentences improves by this simple approach. Performance of Current MT systems on NER ! Current MT systems tend to drop or mistranslate NEs in a sentence, which doesn’t hurt the BLEU [6] score much but it does hurt a human’s readability. We found many examples of this in the output from current Stanford MT system as well: Chinese:27日中午,他们已被安全转移到普吉岛。English: 27 noon , they have been shifted to safe places to 3pm .Here Chinese phrase ‘普吉岛’ means ‘Phuket Island’ but MT system dropped it from the translation. The available reference translations also show that we should have ‘Phuket Island’. Similar observation was made in a number of other sentence translations. Chinese: 据泰米尔纳德邦首府钦奈的渔业界权威人士介绍,这.... English: “the authority of the capital , according to Nadu fishery source . . . . In this sentence, system dropped ‘Tamil’ from the translation of ‘泰米尔纳德邦 ‘ which means ‘Tamil Nadu’. Experiment 1: Adding NER data to help aligner !One way to help the aligner is to append the list of NEs to the bi-text training data. This will increase the probability of matching these NEs in source language to their counterparts in target language. It will also increase the vocabulary of the target language (these NEs may not have been present earlier in the training sentences). For this purpose, we used the named entity data provided by Linguistic Data Consortium [7]. It contains a list of Chinese-English bi-directional NEs compiled from Xinhua News Agency newswire texts. We preprocessed this data to match the original bi-text training data. Pre-processing included changing encoding of Chinese text from GB to UTF8 and adding spaces before and after special symbols ( + , - , / , ; , . , \ ). The following BLEU [6] scores were obtained: Dev Set Test Set Base Case 31.968 31.349 With NER Data 31.916 30.268 Analysis 1. BLEU score for development set doesn’t change much. However, BLEU score for Test set decreases considerably (approx. 0.9) 2. Analysis of individual sentences reveals that this approach does improve performance for NERs that were present in LDC data. Consider the example: Chinese: 27日 中午 , 他们 已 被 安全 转移 到 普吉岛 。 Translation: at noon on the 27th , and they have been shifted to safe places to phuket . Chinese: 据 泰米尔纳德邦 首府 钦奈 的 渔业界 权威 人士 介绍 , 这 场 突如其来 的 . . . . . Translation: “according to an authoritative source , acting for the state of tamil nadu capital . . . Here we can see that whereas base case had dropped ‘Phuket’ from the sentence, adding additional data helped it to align the Chinese word 普吉岛 to Phuket. Similarly for second sentence it preserved ‘tamil nadu’ whereas base case had dropped ‘tamil’.3. However, there were also examples where translating NEs reduced BLEU score. For example: Chinese: 法新社 伦敦 六日 电 Base Case: hong kong presse , 6 ( xinhua ) NER Translation: agence france presse london 6 ( xinhua ) Ref 0: afp on the 6th , london Ref 1: afp , london , 6th . Here we can see that 法新社 was originally translated to ‘hong kong presse’, whereas after adding additional data its correctly translated to ‘agence france presse’. Its also noteworthy that while this is correct, it doesn’t help in BLEU score as all reference translations had the corresponding translation as acronym “afp”. Improvement based on analysis Analysis of the NEs list provided by LDC showed that although most of the foreign words are English or can appear in English texts, there are also many non-English words. This was also observed in the example above where ‘afp’ was written as ‘agence france presse’. We therefore made an effort to refine the original NEs list. We used the following filtering: - For lists containing names of


View Full Document

Stanford CS 224 - Lecture Notes

Documents in this Course
Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?