This preview shows page 1-2-19-20 out of 20 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Massachvsetts Institvte of TechnologyDepartment of Electrical Engineering and Computer Science6.863J/9.611J, Natural Language ProcessingLaboratory 1, Part 2: Word Parsing and Finite-state TransducersHanded out: February 13, 2009 Due: February 27, 2009Don’t Panic!1 Goals of Lab 1, Part 2In this part of the lab you will design transducer automata and a lexicon automaton (corresponding to twofiles, a spanish.yaml file and a spanish.lex file), that will be able to carry out a morphological analysis fora subset of Spanish. The following sections describe the Spanish that your design should be designed to copewith. As usual, please email your lab report (with the files described below) to [email protected],and include in your email the subject line: 6.863 Lab 1b [<first name> <last name>]. Starter templatefiles and code required for the lab can be downloaded from:http://web.mit.edu/6.863/spring2009/code/lab1/lab1b.tgzMake a copy of spanish-template.yaml as spanish.yaml to contain your edits.Your lab report should contain the following three elements:1. lab1b.pdf: a lab writeup containing:• A brief description of how your system operates.• What additional lexical (underlying) characters have you introduced (if any)?• What do your subset definitions represent?• What is the purpose of each automaton?• Briefly, how does each automaton work?• Problems that arose in the design of your system.• A brief discussion on how extendible your system would be if one had to add more nouns andverbs (without extending the morphological processes you handle).As with part one, we have created LaTeX templates for you to use for your writeup, available fromthis url:http://web.mit.edu/6.863/spring2009/writeups/lab1b/2. lab1b.zip: an archive of your changes as a zip file. At a minimum, your zip file should contain aspanish.yaml file and the corresponding spanish.lex file. Please include your name, your collabora-tors, and descriptive comments at the head of your source files.3. lab1b.log: a log record of a batch run on the test suite file spanish.rec. To generate this log,you can use the program provided in the starter file package: spanishtest.py. This script expectsspanish.rec, spanish.yaml to be in your working directory.Recognizably incomprehensible lab reports will not be graded. If you are a native Spanish speaker, youmay find that your judgments disagree with some of those included here. The ones included here (real bugsaside) are to be considered the “gold standard” to use; we understand that dialect variants may exist. (If youare industrious and clever, you may want to think about how dialectical variation might be accommodatedin a system like this one.) By the end of this laboratory you should have a very good idea about how todesign a word parser for a real language. If you need a reference on Spanish verb conjugations, you can usethis link: http://www.conjugation.org/.Important: These labs have been used before and debugged. However, if you find an error in the lab-oratory assignment, please let us know (preferably by email to the TAs and me (you can use the name:[email protected]), so that we may inform the rest of the class.11.1 Understanding the automata rule formats: the spelling change transducersIn the following sections we describe the automata (yaml) and lexicon (lex) formats. We covered much ofthis same territory in part 1, but it bears repeating and some additional details. After these preliminaries,we turn to the Spanish phenomena we want you to handle with your system – this is the real work of thelaboratory. We provide you with one ‘starter’ rule that handles one Spanish spelling-change phenomenon,so as to give you a concrete idea of what the spelling change rules look like.1.1.1 The AlphabetYou are to assume that Spanish orthography (its written form) uses the following characters, some of whichare double characters used to stand for the accented characters that would be otherwise difficult to type.a a´ b c d e e´ f g h i i´ j k l m n n~ o o´ p q r s t u u´ v w x y zHere, a´ denotes an accented a, ´a; e´ denotes an accented e, ´e; i´ denotes an accented i, ´ı; n~ an ntilde, ~n; o´ an accented o, ´o; and u´ an accented u, ´u. For example, la´piz denotes l´apiz.In addition, you may find the following three underlying (lexical) characters useful: C, J,and Z. Byconvention we use CAPITAL letters to denote characters that have a special meaning as underlying lexicalforms, but never appear on the surface. In particular, C stands for a possible c softening, J stands forpossible g softening, and Z stands for a possible z insertion. To repeat: It is important to rememberwhat an ‘underlying form’ means! If J is an underlying character, then it can never appear on the surface– the written output. That means your Spanish system should never generate a surface form such ascoJer, with an underlying J. That is simply wrong. Finally, note that we say you may find these C, J,and Z characters useful. But, your solution might posit other underlying characters, or possibly no specialunderlying characters at all.1.1.2 The format of the yaml (spelling change rules) fileAs with the english.yaml file we described in part 1 of the Laboratory, the purpose of the spanish.yamlfile is to set up the possible lexical and surface characters to use, the lexicon file to load, and the finite-statetransducer automata that dictate the legitmate lexical:surface character pair sequences. We will now simplylargely repeat the description of this file from the first part of the Laboratory, with some additional detailsparticular to Spanish.Repeating our description from the first part of the Laboratory: The conventions we are following forthis file’s format are particular to the mark-up language we are using, called ‘yaml’ (hence the file namingconvention); you can find out more about yaml at www.yaml.org. The .yaml file contains the following keys.• boundary: the symbol that represents the end of the word (usually #). (You never type this as inputto “Recognize” since it is used to demarcate the way a word is split up in the lexicon.)• defaults: a space-separated list of lexical, surface character pairs that should be allowed withouthaving a rule explicitly mentioning them. As a convention to save space, when we write down just asingle character like a this actually denotes the lexical, surface pair a:a.• subsets: a mapping of subset characters to


View Full Document

MIT 6 863J - Study Guide

Documents in this Course
N-grams

N-grams

42 pages

Semantics

Semantics

75 pages

Semantics

Semantics

82 pages

Semantics

Semantics

64 pages

Load more
Download Study Guide
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Guide and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Guide 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?