Stanford CS 224 - Lecture Notes - D2411681

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Lecture Notes

DOC PREVIEW

Stanford CS 224 - Lecture Notes

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

NER_1NER_2CS 224NUsing Named Entity Recognitionto improveMachine TranslationNeeraj AgrawalAnkush SinglaAbstract !Named Entities (NEs) are a very important part of a sentence and its important for a Machine Translation (MT) system to get them right. Mistranslating or dropping NEs may not change a sentence’s BLEU score by much, but it can hurt sentence’s human readability considerably. Current MT systems consider NE’s as normal words and subsequently drop or mistranslate them during the translation. In our paper, we present three different approaches to help these systems treat NE’s differently and help improve the human readability overall. Introduction Our focus is on Chinese to English translation. We start with the current Stanford Machine Translation system. Treating the Stanford system as a reference standard, we experiment with using Named Entity Recognition (NER) for improving its performance. We experimented with three different ideas of using NER data: adding named entity data to help the aligner, modifying the language model, and giving higher weights to translations with same number of Named Entities (NE) in source and target. We discuss each of these ideas in detail, algorithms used and the motivation for choosing them. We also provide an in-depth analysis of the performance, highlighting the improvements as well as point out the cases where performance decreased. Previous Work Most of the published work that uses NER for MT has been directed towards learning to transliterate NEs. The work done by Ulf Hermjakob and Kevin Knight et. al. [1] for Arabic-English translation shows that improvement in MT can be achieved by transliterating NEs in source data (instead of trying to translate them). It builds on the hypothesis that MT system drops or mistranslates NEs when they do not occur in the training data. Another paper from Bogdan Babych et. al. [2] uses a simpler approach for European languages. It creates a “Do Not Translate” list based on NER and simply copies any word in this list from source to target without any translation or transliteration. Their analysis shows that ‘well-formedness’ of sentences improves by this simple approach. Performance of Current MT systems on NER ! Current MT systems tend to drop or mistranslate NEs in a sentence, which doesn’t hurt the BLEU [6] score much but it does hurt a human’s readability. We found many examples of this in the output from current Stanford MT system as well: Chinese:27日中午,他们已被安全转移到普吉岛。English: 27 noon , they have been shifted to safe places to 3pm .Here Chinese phrase ‘普吉岛’ means ‘Phuket Island’ but MT system dropped it from the translation. The available reference translations also show that we should have ‘Phuket Island’. Similar observation was made in a number of other sentence translations. Chinese: 据泰米尔纳德邦首府钦奈的渔业界权威人士介绍,这.... English: “the authority of the capital , according to Nadu fishery source . . . . In this sentence, system dropped ‘Tamil’ from the translation of ‘泰米尔纳德邦 ‘ which means ‘Tamil Nadu’. Experiment 1: Adding NER data to help aligner !One way to help the aligner is to append the list of NEs to the bi-text training data. This will increase the probability of matching these NEs in source language to their counterparts in target language. It will also increase the vocabulary of the target language (these NEs may not have been present earlier in the training sentences). For this purpose, we used the named entity data provided by Linguistic Data Consortium [7]. It contains a list of Chinese-English bi-directional NEs compiled from Xinhua News Agency newswire texts. We preprocessed this data to match the original bi-text training data. Pre-processing included changing encoding of Chinese text from GB to UTF8 and adding spaces before and after special symbols ( + , - , / , ; , . , \ ). The following BLEU [6] scores were obtained: Dev Set Test Set Base Case 31.968 31.349 With NER Data 31.916 30.268 Analysis 1. BLEU score for development set doesn’t change much. However, BLEU score for Test set decreases considerably (approx. 0.9) 2. Analysis of individual sentences reveals that this approach does improve performance for NERs that were present in LDC data. Consider the example: Chinese: 27日中午 , 他们已被安全转移到普吉岛。 Translation: at noon on the 27th , and they have been shifted to safe places to phuket . Chinese: 据泰米尔纳德邦首府钦奈的渔业界权威人士介绍 , 这场突如其来的 . . . . . Translation: “according to an authoritative source , acting for the state of tamil nadu capital . . . Here we can see that whereas base case had dropped ‘Phuket’ from the sentence, adding additional data helped it to align the Chinese word 普吉岛 to Phuket. Similarly for second sentence it preserved ‘tamil nadu’ whereas base case had dropped ‘tamil’.3. However, there were also examples where translating NEs reduced BLEU score. For example: Chinese: 法新社伦敦六日电 Base Case: hong kong presse , 6 ( xinhua ) NER Translation: agence france presse london 6 ( xinhua ) Ref 0: afp on the 6th , london Ref 1: afp , london , 6th . Here we can see that 法新社 was originally translated to ‘hong kong presse’, whereas after adding additional data its correctly translated to ‘agence france presse’. Its also noteworthy that while this is correct, it doesn’t help in BLEU score as all reference translations had the corresponding translation as acronym “afp”. Improvement based on analysis Analysis of the NEs list provided by LDC showed that although most of the foreign words are English or can appear in English texts, there are also many non-English words. This was also observed in the example above where ‘afp’ was written as ‘agence france presse’. We therefore made an effort to refine the original NEs list. We used the following filtering: - For lists containing names of

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

Stanford CS 224 - Lecture Notes

Sign up for free to view:

Please select your school