DOC PREVIEW
MIT 6 863J - Laboratory 1b: Two-level morphology

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Massachusetts Institute of Technology6.863J/9.611J, Natural Language Processing, Spring 2003Department of Electrical Engineering and Computer ScienceDepartment of Brain and Cognitive SciencesLaboratory 1b: Two-level morphology — Building a SystemHanded out: February 12, 2003 Due: February 24, 20031 IntroductionFor the second part of Lab 1, you will design Kimmo automata and Kimmo lexicons that can carryout a morphological analysis of a subset of Spanish. The following sections describe the processesthat your machines should be designed to cope with. Your lab report should contain three elements;please email a URL containing your lab report to our TA, [email protected]:• A description of how your system operates. What additional lexical characters have youintroduced (if any)? What do your subset definitions represent? What is the purpose of eachautomaton? Briefly, how does each automaton work? If appropriate, mention problems thatarose in the design of your system.• Pointers to your listings of your .lex file and your .rul files.• A pointer to a log record of a recognition run on the file spanish.rec. This file may be foundat the top-level of the course locker and also on the web page link for this laboratory.Recognizably incomprehensible lab reports will not be graded. If you are a native Spanish speaker,you may find that your judgements disagree with those included here. The ones included here (realbugs aside) are to be considered the “gold standard” to use; we understand that dialect variantsmay exist. (If you are industrious and clever, you may want to think about how dialectical variationmight be accommodated in a system like this one.)Important: These labs have been used b e fore and debugged. However, if you find an error inthe laboratory assignment, please let us know (preferably by em ail to the lab TA), so that we mayinform the rest of the class.In the following sections we describe the automata (.rul) and lexicon (.lex) formats, along withsome of the (simple) tools we provide to aid in building the automata and visualizing them. Afterthis preliminary, we turn to the three Spanish phenomena we want you to handle with your system.11.1 Kimmo Automata FormatYou are to assume that Spanish orthography uses the following characters, some of which we haveintroduced to stand for accented characters .a ^ b c C d e < f g h i { j J k l m n ~ o * p q r s t u > v w x y z ZHere, ^ denotes an accented a, (´a), < denotes an accented e (´e); { an accente d i (´ı); * denotes anaccented o (´o); > denotes a u ( ¨u) accent; and ~ denotes an n tilde (˜n). For example, l^piz standsfor l´apiz . In addition, by convention we use capital letters to denote characters that have a specialmeaning. In particular, C stands for a possible c softening, J stands for possible g softening, andZ stands for possible z insertion. Given this, your automaton (.rul) file should be formatted asfollows. Do note, however, that section 1.1.1 on page 4 discusses an alternative method of writingthe automata down that most people have found preferable.1. The very first line of the .rul file should declare your ALPHABET characters, the individualtokens that you will use.ALPHABETa ^ b c C d e < f g h i { j J k l m n ~ o * p q r s t u > v w x y z Z +Note that the affix marker ‘+’ is required as part of the alphabet. If you later dec ide youneed additional characters for handling Spanish phenomena, you must modify this alphabet.2. In the next three lines, declare your NULL (empty, zero), ANY (automata relative wildcard),and BOUNDARY (end of word) characters:NULL 0ANY @BOUNDARY #3. Next declare the SUBSETS of characters you want to refer to — groups of characters that willbe referred to in the automata rules. A SUBSET, then, is an abbreviatory device so that youdon’t have to keep mentioning each individual character by name, when it is really a GROUPof characters that you mean. For example, in English, the characters s, x, z are often calledSibilants because the Sibilants behave alike in many rules, like the one that inserts an e infox+s to yield foxes. You should use only the subsets that are necessary for handling thephenomena described in this lab.SUBSETS are denoted by capital letters. For example, in this lab you will require (say) thesubset V to stand for the class of vowels (V):SUBSET V a e i o u ^ < { * >2You may also find it useful to group vowels according to the following subsets: Back (B),Front, Low, and High (For our purposes, these refer to the way in which the vowel soundis made — e.g., a Front vowel is articulated at the front of the mouth. Try it yourself —that is the great thing about language, you can try all these experiments on yourself, withoutclearing human subject protocols.)SUBSET BACK u o a > * ^SUBSET FRONT e i < {SUBSET LOW e o a < * ^SUBSET HIGH i u { >Finally, you may need to declare a subset for consonants:SUBSET CONSONANTS b c d f g h j k l m n ~ p q r s t v w x y zSummarizing so far then, the first lines of your .rul file should look like this:; semi-colon is a comment, like Lisp - put them anywhere; This is the format for the .rul file for Spanish. Note that blank lines are ignored;; + = morpheme breakALPHABETa ^ b c C d e < f g h i { j J k l m n ~ o * p q r s t u > v w x y z Z +NULL 0ANY @BOUNDARY #SUBSET V a e i o u ^ < { * >SUBSET BACK u o a > * ^SUBSET FRONT e i < {SUBSET LOW e o a < * ^SUBSET HIGH i u { >SUBSET CONSONANTS b c d f g h j k l m n ~ p q r s t v w x y z4. At this point, you skip one blank line and then are ready to specify the rule transducers them-selves. Your very first automaton must define the default surface character correspondence(that is, the surface characters paired with themselves). It is a “dummy” automaton that hasjust 1 state, and one transition for each surface character, to itself.Thus, if there are 33 characters that can be written “on the surface,” the resulting fsa tablewill have 33 columns. (Note that purely underlying characters like C and Z are not included.)This is required as a mechanical device by the Kimmo system. The dummy automaton shouldinclude the ANY character, @, (but not the affix boundary marker +. This is because @ is apossible surface match, but + can only be an underlying, lexical character). This automatonshould thus be defined as follows:3RULE "Surface Characters" 1 33a ^ b c d e < f g h i { j k l m n ~ o * p q r s t u > v w x y z @a ^ b c d e < f g h i { j k l m n ~ o * p …


View Full Document

MIT 6 863J - Laboratory 1b: Two-level morphology

Documents in this Course
N-grams

N-grams

42 pages

Semantics

Semantics

75 pages

Semantics

Semantics

82 pages

Semantics

Semantics

64 pages

Load more
Download Laboratory 1b: Two-level morphology
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Laboratory 1b: Two-level morphology and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Laboratory 1b: Two-level morphology 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?