94CS 536 Spring 2007©ExamplesLet D be the ten single digitsand let L be the set of all 52letters. Then• A Java or C++ single-line commentthat begins with // and ends with Eolcan be defined as:Comment = // Not(Eol)* Eol• A fixed decimal literal (e.g., 12.345)can be defined as:Lit = D+. D+•An optionally signed integer literalcan be defined as:IntLiteral = ( '+' | − | λ ) D+(Why the quotes on the plus?)95CS 536 Spring 2007©• A comment delimited by ## markers,which allows single #’s within thecomment body:Comment2 =## ((# | λ) Not(#) )* ##All finite sets and many infinite setsare regular. But not all infinite setsare regular. Consider the set ofbalanced brackets of the form[ [ […] ] ].This set is defined formally as{ [m ]m | m ≥ 1 }.This set is known not to be regular.Any regular expression that tries todefine it either does not get allbalanced nestings or it includes extra,unwanted strings.96CS 536 Spring 2007©Finite Automata and ScannersA finite automaton (FA) can beused to recognize the tokensspecified by a regularexpression. FAs are simple,idealized computers thatrecognize strings belonging toregular sets. An FA consists of:• A finite set of states• A set of transitions (or moves) fromone state to another, labeled withcharacters in Σ• A special state called the start state• A subset of the states called theaccepting, or final, states97CS 536 Spring 2007©These four components of afinite automaton are oftenrepresented graphically:Finite automata (the plural ofautomaton is automata) arerepresented graphically usingtransition diagrams. We start atthe start state. If the next inputcharacter matches the label onis a transitionis the start stateis an accepting stateis a state98CS 536 Spring 2007©a transition from the currentstate, we go to the state itpoints to. If no move ispossible, we stop. If we finishin an accepting state, thesequence of characters readforms a valid token; otherwise,we have not seen a valid token.In this diagram, the validtokens are the stringsdescribed by the regularexpression (a b (c)+ )+.abcca99CS 536 Spring 2007©Deterministic Finite AutomataAs an abbreviation, a transitionmay be labeled with more thanone character (for example,Not(c)). The transition may betaken if the current inputcharacter matches any of thecharacters labeling the transition.If an FA always has a uniquetransition (for a given state andcharacter), the FA is deterministic(that is, a deterministic FA, orDFA). Deterministic finiteautomata are easy to programand often drive a scanner.If there are transitions to morethan one state for some character,then the FA is nondeterministic(that is, an NFA).100CS 536 Spring 2007©A DFA is conveniently representedin a computer by a transitiontable. A transition table, T, is atwo dimensional array indexed bya DFA state and a vocabularysymbol.Table entries are either a DFAstate or an error flag (oftenrepresented as a blank tableentry). If we are in state s, andread character c, then T[s,c] willbe the next state we visit, or T[s,c]will contain an error markerindicating that c cannot extendthe current token. For example,the regular expression// Not(Eol)* Eolwhich defines a Java or C++single-line comment, might betranslated into101CS 536 Spring 2007©The corresponding transitiontable is:A complete transition tablecontains one column for eachcharacter. To save space, tablecompression may be used. Onlynon-error entries are explicitlyrepresented in the table, usinghashing, indirection or linkedstructures.State Character/ Eol a b …12233343334eofEol//Not(Eol)1234102CS 536 Spring 2007©All regular expressions can betranslated into DFAs that accept(as valid tokens) the stringsdefined by the regularexpressions. This translation canbe done manually by aprogrammer or automaticallyusing a scanner generator.A DFA can be coded in:• Table-driven form• Explicit control formIn the table-driven form, thetransition table that defines aDFA’s actions is explicitlyrepresented in a run-time tablethat is “interpreted” by a driverprogram.In the direct control form, thetransition table that defines aDFA’s actions appears implicitly asthe control logic of the program.103CS 536 Spring 2007©For example, supposeCurrentChar is the current inputcharacter. End of file isrepresented by a special charactervalue, eof. Using the DFA for theJava comments shown earlier, atable-driven scanner is:State = StartStatewhile (true){if (CurrentChar == eof)breakNextState =T[State][CurrentChar] if(NextState == error)breakState = NextStateread(CurrentChar)}if (State in AcceptingStates)// Process valid tokenelse // Signal a lexical error104CS 536 Spring 2007©This form of scanner is producedby a scanner generator; it isdefinition-independent. Thescanner is a driver that can scanany token if T contains theappropriate transition table.Here is an explicit-control scannerfor the same comment definition:if (CurrentChar == '/'){read(CurrentChar)if (CurrentChar == '/')repeatread(CurrentChar)until (CurrentChar in{eol, eof})else //Signal lexical errorelse // Signal lexical errorif (CurrentChar == eol)// Process valid tokenelse //Signal lexical error105CS 536 Spring 2007©The token being scanned is“hardwired” into the logic of thecode. The scanner is usually easyto read and often is moreefficient, but is specific to a singletoken definition.106CS 536 Spring 2007©More Examples• A FORTRAN-like real literal (whichrequires digits on either or both sidesof a decimal point, or just a string ofdigits) can be defined asRealLit = (D+(λ | . )) | (D*. D+)This corresponds to the DFA.DDDD.107CS 536 Spring 2007©• An identifier consisting of letters,digits, and underscores, which beginswith a letter and allows no adjacentor trailing underscores, may bedefined asID = L (L | D)* ( _ (L | D)+)*This definition includes identifierslike sum or unit_cost, butexcludes _one and two_ andgrand___total. The DFA is:L | DLL | D_108CS 536 Spring 2007©Lex/Flex/JLexLex is a well-known Unix scannergenerator. It builds a scanner, inC, from a set of regularexpressions that define thetokens to be scanned.Flex is a newer and faster versionof Lex.Jlex is a Java version of Lex. Itgenerates a scanner coded inJava, though its regularexpression definitions are veryclose to those used by Lex andFlex.Lex, Flex and JLex are largely non-procedural. You don’t need to tellthe tools how to scan. All youneed to tell it what you wantscanned (by giving it definitionsof valid tokens).109CS 536
View Full Document