U of I CS 421 - Regular expressions - D2563951

Home> Schools> University of Illinois> Computer Science (CS) > CS 421> Regular expressions

DOC PREVIEW

U of I CS 421 - Regular expressions

School name University of Illinois

Course Cs 421- Natural Language Processing

Pages 31

This preview shows page 1-2-14-15-30-31 out of 31 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 421 Lecture 6: Regular expressions Announcements Lecture outline Regular expressions Ocamllex6/8/20091Announcements MP2 extension and update New due date 1:00pm Friday, June 12 Problem 10 has been updated “Valid” old solutions will get full credit MP3 has been posted Due 1:00pm Wed, June 17 Warning: more work than the first MPs Collaboration is allowed in two-person teamsOverview of Ocamllex Automatic OCaml lexer generatorOCaml definition of a lexing functionSpecification of tokens via regular expressionsOcamllexRegular expressions A regular expression is one ofε, a.k.a. “” ‘a’ for any character a r1r2, where r1and r2are regular expr’s r1|r2, where r1and r2are regular expr’s r*, where r is a regualr expr Ø Every regular expr r represents a set of strings, denoted L(r)Languageof rRegular expression examples L(‘a’ ’b’ ’c’) = {“abc”} L((‘a’ | ’b’) ‘c’) = {“ac”, “bc”} L((‘a’ | ‘b’)* ‘c’) = {“c”, “ac”, “bc”, “aac”, “abc”, …}Regular expression examples Keywords: ‘c’ ‘a’ ‘s’ ‘e’ | ‘c’ ‘ ‘l’ ‘a’ ‘s’ ‘s’ | … Operators ‘<‘ | ‘<‘ ‘<‘ | ‘<‘ ‘=‘ | … Identifiers (‘a’ | ‘b’ | … | ‘z’ | ‘A’ | … | ‘Z’)(‘a’ | ‘b’ | … | ‘z’ | ‘A’ | … | ‘Z’ | ‘0’ | ‘1’ | … | ‘9’)* Int literals ??Abbreviations “c1c2… cn” => ‘c1’ ‘c2’ … ‘cn’ [‘a’ – ‘z’ ‘#’] => ‘a’ | ‘b’ | … | ‘z’ | ‘#’ [‘a’ ‘w’ ‘#’] => ‘a’ | ‘w’ | ‘#’ r+ => r(r*) r? => r | “” [^ ‘a’ – ‘z’] => all chars except ‘a’ – ‘z’(complement of ‘a’ – ‘z’) _ => any single charRegular expressions examples Floating-point literal[‘0’-’9’]+ . [‘0’-’9’]+ ([‘e’’E’] [‘+’’-’]? [‘0’-’9’]+)? Note: r* = (r+)?Regular expression examples C++ style comments (// …)“//” [^ ‘\n’]* ‘\n’ C style comments (/* … */)“/*” ([^ ‘*’] | ‘*’+ [^ ‘*’’/’])* “*/”Implementing regular expressions Translate REs to NFAs Translate NFAs to DFAsLexing with regular expressions Create one large RE:RE for case {action for case}| RE for class {action for class}| …| RE for idents {action for idents}| RE for FP lits {action for FP lits}| RE for Int lits {action for int lits} Then add some actionsLexing with regular expressions (cont.) Ambiguous cases: Two tokens found, one longer Choose the longer one Two tokens found, the same length Choose the earlier reg. expr.Ocamllex mechanics Put table of regular expressions and corresponding actions (written in Ocaml) into a file <filename>.mll Callocamllex <filename.mll> Produces Ocaml code for a lexical analyzer in<filename>.mlOcamllex input{header} let ident = regexp …rule entrypoint[arg1 … argn] = parse regexp {action}and entrypoint[arg1 … argn] =parse … and …{trailer}Ocamllex input{header} let ident = regexp …rule entrypoint[arg1 … argn] = parse regexp {action}and entrypoint[arg1 … argn] =parse … and …{trailer}header – ocaml defnsEntrypoint – name of gen’d function with argsarg1, …, argn, lexbuftrailer – ocaml defnsOcamllex inputheaderand trailercontain arbitrary Ocaml code put at top and bottom of <filename>.ml let ident= regexp… introduces identfor use in later regular expressionsSample inputrule main = parse[‘0’-’9’]+ { print_string “Int\n” }| [‘0’-’9’]+’.’[‘0’-’9’] { print_string “Float\n” }| [‘a’-’z’]+ { print_string “String\n” }| _ { main lexbuf }{let newlexbuf = (Lexing.from_channel stdin) inprint_string “Ready to lex.\n”;main newlexbuf}Ocamllex output <filename>.ml contains one lexing function per entrypoint Name of function is name given for entrypoint Each entry point becomes an Ocaml function that takes n+1 arguments The extra implicit argument being of type Lexing.lexbufarg1…argnare for use in actionOcamllex regular expressions ‘a’ : single quoted characters for letters _ : matches any character eof : special end_of_file marker e1e2: concatenation “string” : concatenation of a sequence of characters e1|e2: choiceOcamllex regular expressions [c1-c2] : choice of any character between first and second, inclusive, as determined by character codes [^c1-c2] : choice of any character NOT in the set e* : same as before e+ : same as e e* e? : option – was e1|εOcamllex regular expression e1#e2: the characters in e1but not in e2; e1and e2must describe just sets of characters ident: abbreviation for earlier reg exp in let ident= regexp e1 as id: binds the result of e1 to id, to be used in the associated action Example([‘0’-’9’]+ as decpart ‘.’ ([‘0’-’9’]+ as fracpart …Ocamllex manual More details can be found at http://caml.inria.fr/pub/docs/manual-ocaml/manual026.htmlExample: test.mll{ type result = Int of int | Float of float | String of string }let digit = [‘0’-’9’]let digits = digit+let lower_case = [‘a’-’z’]let upper_case = [‘A’-’Z’]let letter = lower_case | upper_caselet letters = letter+...Example: test.mllrule main = parsedigits’.’digits as f { Float (float_of_string f) }| digits as n { Int (int_of_string n) }| letters as s { String s }| _ { main lexbuf }{let newlexbuf = (Lexing.from_channel stdin) inprint_string “Ready to lex.\n”;main newlexbuf }Example> ocamllex test.mll> ocaml# #use “test.ml”...val main : Lexing.lexbuf -> result = <fun>Ready to lex.hi there 234 5.6- : result = String “hi”# What happened to the rest?Example# let b = Lexing.from_channel stdin;;# main b;;hi 789 there- : result = String “hi”# main b;;- : result = Int 789# main b;;- : result = String “there”Problem How to get the lexer to look at more than the first token? Answer 1: repeatedly call lexing function Answer 2: actionhas to tell it to – recursive calls. Value of action is token list instead of token. Note: already used this with the _ caseExamplerule main = parsedigits’.’digits as f { Float (float_of_string f):: main lexbuf }| digits as n { Int (int_of_string n):: main lexbuf }| letters as s { String s :: main lexbuf }| eof { [] }| _ { main lexbuf

View Full Document