DOC PREVIEW
MIT 6 863J - Two-­level morphology — Building a System

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1 Massachusetts Institute of Technology 6.863J/9.611J, Natural Language Processing, Spring 2003 Department of Electrical Engineering and Computer Science Department of Brain and Cognitive Sciences Laboratory 1b: Two-level morphology — Building a System Handed out: February 12, 2003 Due: February 24, 2003 Introduction For the second part of Lab 1, you will design Kimmo automata and Kimmo lexicons that can carry out a morphological analysis of a subset of Spanish. The following sections describe the processes that your machines should be designed to cope with. Your lab report should contain three elements; • A description of how your system operates . What additional lexical characters have you introduced (if any)? What do your subset definitions represent? What is the purpose of each automaton? Briefly, how does each automaton work? If appropriate, mention problems that arose in the design of your system. • Pointers to your listings of your .lex file and your .rul files. • A pointer to a log record of a recognition run on the file spanish.rec. This file may be found on the web page link for this laboratory. Recognizably incomprehensible lab reports will not be graded. If you are a native Spanish speaker, you may find that your judgements disagree with those included here. The ones included here (real bugs aside) are to be considered the “gold standard” to use; we understand that dialect variants may exist. (If you are industrious and clever, you may want to think about how dialectical variation might be accommodated in a system like this one.) Important: These labs have been used before and debugged. However, if you find an error in the laboratory assignment, please let us know so that we may inform the rest of the class. In the following sections we describe the automata (.rul) and lexicon (.lex) formats, along with some of the (simple) tools we provide to aid in building the automata and visualizing them. After this preliminary, we turn to the three Spanish phenomena we want you to handle with your system. 11.1 Kimmo Automata Format You are to assume that Spanish orthography uses the following characters, some of which we have introduced to stand for accented characters. a ^ b c C d e < f g h i { j J k l m n ~ o * p q r s t u > v w x y z Z Here, ^ denotes an accented a, (´a), < denotes an accented e (´e); { an accented i (´ı); * denotes an o); > denotes a u (¨ n). For example, l^piz stands accented o (´ u) accent; and ~ denotes an n tilde (˜for l´apiz . In addition, by convention we use capital letters to denote characters that have a special meaning. In particular, C stands for a possible c softening, J stands for possible g softening, and Z stands for possible z insertion. Given this, your automaton (.rul) file should be formatted as follows. Do note, however, that section 1.1.1 on page 4 discusses an alternative method of writing the automata down that most people have found preferable. 1. The very first line of the .rul file should declare your ALPHABET characters, the individual tokens that you will use. ALPHABET a ^ b c C d e < f g h i { j J k l m n ~ o * p q r s t u > v w x y z Z + Note that the affix marker ‘+’ is required as part of the alphabet. If you later decide you need additional characters for handling Spanish phenomena, you must modify this alphabet. 2. In the next three lines, declare your NULL (empty, zero), ANY (automata relative wildcard), and BOUNDARY (end of word) characters: NULL 0 ANY @ BOUNDARY # 3. Next declare the SUBSETS of characters you want to refer to — groups of characters that will be referred to in the automata rules. A SUBSET, then, is an abbreviatory device so that you don’t have to keep mentioning each individual character by name, when it is really a GROUP of characters that you mean. For example, in English, the characters s, x, z are often called Sibilants because the Sibilants behave alike in many rules, like the one that inserts an e in fox+s to yield foxes. You should use only the subsets that are necessary for handling the phenomena described in this lab. SUBSETS are denoted by capital letters. For example, in this lab you will require (say) the subset V to stand for the class of vowels (V): SUBSET V a e i o u ^ < { * > 2You may also find it useful to group vowels according to the following subsets: Back (B), Front, Low, and High (For our purposes, these refer to the way in which the vowel sound is made — e.g., a Front vowel is articulated at the front of the mouth. Try it yourself — that is the great thing about language, you c an try all these experiments on yourself, without clearing human subject protocols.) SUBSET BACK u o a > * ^ SUBSET FRONT e i < { SUBSET LOW e o a < * ^ SUBSET HIGH i u { > Finally, you may need to declare a subset for consonants: SUBSET CONSONANTS b c d f g h j k l m n ~ p q r s t v w x y z Summarizing so far then, the first lines of your .rul file should look like this: ; semi-colon is a comment, like Lisp - put them anywhere ; This is the format for the .rul file for Spanish. Note that blank lines are ignored ; ; + = morpheme break ALPHABET a ^ b c C d e < f g h i { j J k l m n ~ o * p q r s t u > v w x y z Z + NULL 0 ANY @ BOUNDARY # SUBSET V a e i o u ^ < { * > SUBSET BACK u o a > * ^ SUBSET FRONT e i < { SUBSET LOW e o a < * ^ SUBSET HIGH i u { > SUBSET CONSONANTS b c d f g h j k l m n ~ p q r s t v w x y z 4. At this point, you skip one blank line and then are ready to spec ify the rule transducers them-selves. Your very first automaton must define the default surface character correspondence (that is, the surface characters paired with themselves). It is a “dummy” automaton that has just 1 state, and one transition for each surface character, to itself. Thus, if there are 33 characters that can be written “on the surface,” the resulting fsa table will have 33 columns. (Note that purely underlying characters like C and Z are not included.) This is required as a mechanical device by the Kimmo system. The dummy automaton should include the ANY character, @, (but not the affix boundary marker +. This is because @ is a possible surface match, but + can only


View Full Document

MIT 6 863J - Two-­level morphology — Building a System

Documents in this Course
N-grams

N-grams

42 pages

Semantics

Semantics

75 pages

Semantics

Semantics

82 pages

Semantics

Semantics

64 pages

Load more
Download Two-­level morphology — Building a System
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Two-­level morphology — Building a System and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Two-­level morphology — Building a System 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?