DOC PREVIEW
MIT 6 863J - Writing Kimmo Lexicons

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Writing Kimmo Lexicons 1.0 First, two general tips on rule writing and lexicon writing. Besides tracing a rule, there are two other PCKIMMO commands that you can use. 1. The first is SET RULE <rule number> {ON|OFF} this lets you turn individual rules on or off. This is very helpful if you find that your recognizer returns NO results - ie, gives the output ****NONE***** when it should be giving some parse. Then you can turn off rules one by one, and you can isolate an offending rule this way. (Usually, a rule that has the system winding up in a nonfinal state when it should not be..) To use this, your rules must beging with numbered double quoted comments lines as in, "Rule 3 EPTHENSIS.... mumble....." 5 6 The only thing that matters here in quotes is the "Rule 3" business. If you want to see your list of rules, do SHOW RULES 2. The second debugging aid is the SHOW RULE <rule number> command This is very helpful in dealing with one of the trickier parts of writing rules: how the different subsets and feasible pairs (lexical/surface characters) interact, even within one rule. SHOW RULE will tell you what characters are ACTUALLY being processed by the automata, which can differ from what you wrote down in the automaton spec. Let me illustrate with an example. Consider the following INCORRECT Epenthesis rule. Here S represents the subset s, x, z. Remember this is the rule that is supposed to pair fox+s with foxes. That is, IF the the pair 0:e appears, then it must have a left context of S (where S is x,z, s) followed by +:0 and the right context s#. (note that the right context doesn't care what the underlying lexical form is, so we don't write it down.) That is a DECLARATIVE CONSTRAINT that says this pair is OK - note that the left and right context of 0:e is indeed one of the members of S:S (in this case, X:x). followed by +:0. And the right context is indeed simply s # (on the surface -- we don't have to mention the hash mark # boundary symbol in the lexical or underlying string, really - it is assumed to be the same as the surface string.) So we are really lining up the following pair of characters, where I have written the surface characters in lower case. This is, in fact, how you can develop your own automata. First try pairing up lexical and surface strings, for the Spanish examples. F O X + 0 S # (lexical, or underlying) f o x 0 e s # (surface) RULE "3 Epenthesis. 0:e ==> S +:0____s#" 5 6 s S + 0 # @ s S + e # @ 1: 1 2 1 0 1 1 2: 1 2 3 0 1 13: 1 2 1 4 1 1 4. 5 0 0 0 0 0 5. 0 0 0 0 1 0 The 5, 6 at the end of the RULE statement gives us the number rows and columns in the state table. Recall that 0 here means a reject state. The states are listed on the leftmost column. The transition arc labels are the top row of (lexical, surface) pairs - feasible pairs. The inner cells say what the next states are - if we are in state 1, and see an s/s, then we go to state 1. Given the lexical form fox+s, this table correctly produces foxes, but given kiss+s, it fails to produce the form kisses. Doing the SHOW RULE command will give us the following information, to tell us why it fails: >show rule 3 3 on Epenthesis Epenthesis. 0:e ==> S +:0____s#" s:s ( s:s ) S:S ( x:x z:z ) +:0 ( +:0 ) 0:e ( 0:e ) #:# ( #:# ) @:@ ( b:b d:d F:F g:g j:j k:k l:l m:m n:n p:p q:q r:r t:t v:v w:w y:y a:a e:e i:i o:o u:u ':' -:- -:0 ':0 ) From this display, it is obvious that the column header S:S does NOT contain the pair s:s as might be expected. This is because the column headers s:s and S:S OVERLAP with respect to the pair s:s --- this pair matches BOTH. The pair s:s is assigned to the s:s column because that one is MORE SPECIFIC than the S:S column header. That is, only ONE feasible pair matches the s:s header, while three pairs match the S:S header. Thus the input form kiss+s FAILS the fule because the final s in the root "kiss" is matched to the s:s column, leaving the table in state 1. -- where we really want it to be in state 2 (the left context mentions S:S, that is what state 2 is doing for us...) The table must be revised so that for the first three states the s:s column has the same state transitions as the S:S column. Finally, the command SHOW LEXICON <lexicon name> - this will help if you mistype a lexicon name, etc. 2.0 Writing KIMMO lexicons. (my apologies if you received this more than once - something went amiss with the AI mailer. Also, I've posted this material on the web site). Here we explain how two-level rules work, how they can be implemented as finite-state machines, and all the types of rule constraints can be translated into finite-state tables. We then summarize the rule semantics. This is followed by a detailed discussion of rule conflicts; specficity and conflicts amongst SUBSETS; and finally, and explanation of the rule file format and the rulesin the pc-kimmo file english.rul. It's a lot to read through... but I hope, complete, and will guide you through Spanish. 1. How two-level rules work. Consider Rule 2 (R2) below. R2 t:c ==> ____i The operator ==> means that lexical t is realized as a surface c only (but not always) in the environment preceding i:i. The correspondence t:c declared in R2 is a special correspondence. All two-level descriptions must also contain a set of *default* correspondences, such as t:t, i:i, etc. (This is the so-called "BOGUS RULE" - it isn't really bogus, it is a default.) The sum of the special and default correspondences are the total set of valid correspondences or feasible pairs that can be used in the description. If a two-level description containing R2 (and all default correspondences) is applied to the lexical (underlying) form "tati" (without the quote marks) PCKIMMO proceeds as follow to produce the corresponding surface form(s). (NOTE this is why you can use GENERATE without a dictionary and JUST the .rul file) Beginning with the first character of the input form, it looks to see if there is a correspondence declared for it. Due to R2, it will find that lexical t can correspond to surface c, so it will begin by positing that correspondence. Lexical: t a t i | | | | Rule: R2 | Surface: c At this point the generated has entered R2. For the posited t:c


View Full Document

MIT 6 863J - Writing Kimmo Lexicons

Documents in this Course
N-grams

N-grams

42 pages

Semantics

Semantics

75 pages

Semantics

Semantics

82 pages

Semantics

Semantics

64 pages

Load more
Download Writing Kimmo Lexicons
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Writing Kimmo Lexicons and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Writing Kimmo Lexicons 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?