Unformatted text preview:

94CS 536 Spring 2007©ExamplesLet D be the ten single digitsand let L be the set of all 52letters. Then• A Java or C++ single-line commentthat begins with // and ends with Eolcan be defined as:Comment = // Not(Eol)* Eol• A fixed decimal literal (e.g., 12.345)can be defined as:Lit = D+. D+•An optionally signed integer literalcan be defined as:IntLiteral = ( '+' | − | λ ) D+(Why the quotes on the plus?)95CS 536 Spring 2007©• A comment delimited by ## markers,which allows single #’s within thecomment body:Comment2 =## ((# | λ) Not(#) )* ##All finite sets and many infinite setsare regular. But not all infinite setsare regular. Consider the set ofbalanced brackets of the form[ [ […] ] ].This set is defined formally as{ [m ]m | m ≥ 1 }.This set is known not to be regular.Any regular expression that tries todefine it either does not get allbalanced nestings or it includes extra,unwanted strings.96CS 536 Spring 2007©Finite Automata and ScannersA finite automaton (FA) can beused to recognize the tokensspecified by a regularexpression. FAs are simple,idealized computers thatrecognize strings belonging toregular sets. An FA consists of:• A finite set of states• A set of transitions (or moves) fromone state to another, labeled withcharacters in Σ• A special state called the start state• A subset of the states called theaccepting, or final, states97CS 536 Spring 2007©These four components of afinite automaton are oftenrepresented graphically:Finite automata (the plural ofautomaton is automata) arerepresented graphically usingtransition diagrams. We start atthe start state. If the next inputcharacter matches the label onis a transitionis the start stateis an accepting stateis a state98CS 536 Spring 2007©a transition from the currentstate, we go to the state itpoints to. If no move ispossible, we stop. If we finishin an accepting state, thesequence of characters readforms a valid token; otherwise,we have not seen a valid token.In this diagram, the validtokens are the stringsdescribed by the regularexpression (a b (c)+ )+.abcca99CS 536 Spring 2007©Deterministic Finite AutomataAs an abbreviation, a transitionmay be labeled with more thanone character (for example,Not(c)). The transition may betaken if the current inputcharacter matches any of thecharacters labeling the transition.If an FA always has a uniquetransition (for a given state andcharacter), the FA is deterministic(that is, a deterministic FA, orDFA). Deterministic finiteautomata are easy to programand often drive a scanner.If there are transitions to morethan one state for some character,then the FA is nondeterministic(that is, an NFA).100CS 536 Spring 2007©A DFA is conveniently representedin a computer by a transitiontable. A transition table, T, is atwo dimensional array indexed bya DFA state and a vocabularysymbol.Table entries are either a DFAstate or an error flag (oftenrepresented as a blank tableentry). If we are in state s, andread character c, then T[s,c] willbe the next state we visit, or T[s,c]will contain an error markerindicating that c cannot extendthe current token. For example,the regular expression// Not(Eol)* Eolwhich defines a Java or C++single-line comment, might betranslated into101CS 536 Spring 2007©The corresponding transitiontable is:A complete transition tablecontains one column for eachcharacter. To save space, tablecompression may be used. Onlynon-error entries are explicitlyrepresented in the table, usinghashing, indirection or linkedstructures.State Character/ Eol a b …12233343334eofEol//Not(Eol)1234102CS 536 Spring 2007©All regular expressions can betranslated into DFAs that accept(as valid tokens) the stringsdefined by the regularexpressions. This translation canbe done manually by aprogrammer or automaticallyusing a scanner generator.A DFA can be coded in:• Table-driven form• Explicit control formIn the table-driven form, thetransition table that defines aDFA’s actions is explicitlyrepresented in a run-time tablethat is “interpreted” by a driverprogram.In the direct control form, thetransition table that defines aDFA’s actions appears implicitly asthe control logic of the program.103CS 536 Spring 2007©For example, supposeCurrentChar is the current inputcharacter. End of file isrepresented by a special charactervalue, eof. Using the DFA for theJava comments shown earlier, atable-driven scanner is:State = StartStatewhile (true){if (CurrentChar == eof)breakNextState =T[State][CurrentChar] if(NextState == error)breakState = NextStateread(CurrentChar)}if (State in AcceptingStates)// Process valid tokenelse // Signal a lexical error104CS 536 Spring 2007©This form of scanner is producedby a scanner generator; it isdefinition-independent. Thescanner is a driver that can scanany token if T contains theappropriate transition table.Here is an explicit-control scannerfor the same comment definition:if (CurrentChar == '/'){read(CurrentChar)if (CurrentChar == '/')repeatread(CurrentChar)until (CurrentChar in{eol, eof})else //Signal lexical errorelse // Signal lexical errorif (CurrentChar == eol)// Process valid tokenelse //Signal lexical error105CS 536 Spring 2007©The token being scanned is“hardwired” into the logic of thecode. The scanner is usually easyto read and often is moreefficient, but is specific to a singletoken definition.106CS 536 Spring 2007©More Examples• A FORTRAN-like real literal (whichrequires digits on either or both sidesof a decimal point, or just a string ofdigits) can be defined asRealLit = (D+(λ | . )) | (D*. D+)This corresponds to the DFA.DDDD.107CS 536 Spring 2007©• An identifier consisting of letters,digits, and underscores, which beginswith a letter and allows no adjacentor trailing underscores, may bedefined asID = L (L | D)* ( _ (L | D)+)*This definition includes identifierslike sum or unit_cost, butexcludes _one and two_ andgrand___total. The DFA is:L | DLL | D_108CS 536 Spring 2007©Lex/Flex/JLexLex is a well-known Unix scannergenerator. It builds a scanner, inC, from a set of regularexpressions that define thetokens to be scanned.Flex is a newer and faster versionof Lex.Jlex is a Java version of Lex. Itgenerates a scanner coded inJava, though its regularexpression definitions are veryclose to those used by Lex andFlex.Lex, Flex and JLex are largely non-procedural. You don’t need to tellthe tools how to scan. All youneed to tell it what you wantscanned (by giving it definitionsof valid tokens).109CS 536


View Full Document

UW-Madison CS 536 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?