Stanford CS 224N - Multi-Lingual Document Tagging

Unformatted text preview:

CS224n Final Project Andrey Gusev sunetid: agusevMulti-Lingual Document TaggingProject Overview:In this report I will detail the work on “Multi-Lingual Document Tagging”. The goal is to be able to pass multilingual string and get back each language substring correctly tagged and separated from the phrases in the other languages. Thus, there are two major components that are necessary for overall functionality – first a robust language detector and second boundary detector that can find language split points. Prior to the start of work on this final project I have developed subsystem that was able to do language detection. This subsystem was build over the course of two weeks very recently – one week in January and one week in early April when this class was starting. Since understanding of this system is important for boundary detection I will describe portions of it in this report as well. Furthermore, there were some modifications to that language detection subsystem that will be detailed as well. Overall I would breakdown the work for this project as 1. Porting of language detection subsystem and building necessary tools for command line interface in LanguageDetectionTester class (15%)2. Attempts of tunning of language detection, significant increase in training sets (20%)3. Multilingual test sets and boundary detection algorithms, approaches, and results (65%)The rest of this report is structure as follows. First I introduce command line tools that can be used to run detector, generate various test sets, and run detectors on test sets. Next I will detail implementation of language detection engine. Then I will describe iterations in building language boundary detector. In this section I will also present experimental results and discuss the choices I made for algorithm tunning. In the next section I will focus on results discussion and their analysis as a whole. In future work section I will discuss the extensions that I believe can improve performance of these models and add new functionality. Finally, I discuss some related work.Command line tools:There are quite a few different run modes that can be executed on the LanguageDetectionTester class. Thus, I have created several scripts to allow for easier execution of necessary targets. All scripts are in root of submission directory. Please note that the required data is not submitted with project. All scripts reference the data with DATA_PATH variable in scripts. If you would like to copy the data, copy /afs/ir.stanford.edu/users/a/g/agusev/cs224n/langDetect/data/ to local directory and change DATA_PATH var in all the scripts.1. detectLang – no parameters, takes a test string interactively and prints out confidence levels and each n-gram values for all language, the top language is the detected language2. detectMultiLang – no parameters, also interactive, takes multilingual string and breaks down the string in phrases tagged with languages. Use enter (no string input) to exit. As an example you can use “ yo no hablo espanol but some people parler francais tre bien und das ist eindeutig sehr gut”3. genNgramModels – this probably doesn't need to be run again, takes original source text foreach language (collection of books from Guttenberg project) and creates character ngram models in ./data/languagemodels/ngramModel/ directoty.4. genTrainingAndTestData – this probably doesn't need to be run again, takes original source text and triggers generation of training and test data with 90-10 split. You can use -minTrainSize and -maxTrainSize parameters to change min and max phrase size. Currently by default it is set to min=4 and max=8. Increasing average sample size improves language detection accuracy.5. genMultiLingTestData – this probably doesn't need to be run again, takes generated test set (from above call) and randomly generates single document with phrases from each language mixed. Each phrase is tagged with correct language. Mixed language document will contain 30000 phrases.6. runTestSet – runs generated sets through language detection algorithms and reports accuracy numbers across languages for each algorithms. The parameter -useClassifier controls which algorithm to run. There are three algorithms and bit masking is used to determine which algorithm to run. // 1 - only linear classifier// 2 - bagged decision tree// 4 - logistic classifier// for example 7 selects all of themThese algorithms will be discussed in more detail in next section.7. runMultiLingTestSet – runs multilingual test set through boundary detector and language tagger and reports accuracy numbers on multilingual test set. Uses linear classifier for language classifier and by default used FOUR_WORD_BIGRAM boundary detector. Use -boundaryDetector parameter to change boundary detector. The following boundary detection parameters are supported ONE_WORD, TWO_WORD, THREE_WORD, BASE_BIGRAM, TWO_WORD_BIGRAM, THREE_WORD_BIGRAM, FOUR_WORD_BIGRAM, FIVE_WORD_BIGRAM, SIX_WORD_BIGRAM, FIVE_WORD_NESTED. These parameters will be described in more detail in boundary detection section.Language Detection Engine:The majority of language detection engine was implemented by me prior to this final project. The implementation took place over two week span – one during January 2009 and one more week in the beginning of April 2009. However, due to the fact that this topic is relevant to NLP and boundary detection algorithm section and never was presented in a report formally, in this section I would like to describe major components of this approach. 1.0 Character ngramsThe base component of language detection is character ngrams. More precisely we break each document into 1-gram,2-grams,3-grams,4-grams and 5-grams profiles. Then we use accumulated probabilities over all ngrams for given document. We use only top ngrams. The base formula for how many top ngrams to keep for each ngrams size profile was tweaked during this final project and in fact produced better results. Currently it is set to be Math.min(BASE_TOP_NGRAMS * this.ngramSize, 150) where BASE_TOP_NGRAMS = 50. This is especially important when we construct probabilistic representation of language where we only want to keep ngrams that are truly representative of the language and we would like to exclude all the noise. Once we have broken down document into set of ngrams we can represent each document as a vector in k-dimensional ngram space. Each dimension is possible ngram. A value on


View Full Document
Download Multi-Lingual Document Tagging
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Multi-Lingual Document Tagging and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Multi-Lingual Document Tagging 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?