DOC PREVIEW
LSU CSC 4351 - Tokenization in Linear Time

This preview shows page 1-2-3-4-5 out of 15 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

“Maximal-Munch” Tokenization in LinearTimeTHOMAS REPSUniversity of WisconsinThe lexical-analysis (or scanning) phase of a compiler attempts to partition an input stringinto a sequence of tokens. The convention in most languages is that the input is scanned left toright, and each token identified is a “maximal munch” of the remaining input—the longestprefix of the remaining input that is a token of the language. Although most of the standardcompiler textbooks present a way to perform maximal-munch tokenization, the algorithm theydescribe is one that, for certain sets of token definitions, can cause the scanner to exhibitquadratic behavior in the worst case. In this article, we show that maximal-munch tokeniza-tion can always be performed in time linear in the size of the input.CR Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors—compilers; F.1.1 [Computation by Abstract Devices]: Models of Computation—automata;F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms andProblems—pattern matching; I.5.4 [Pattern Recognition]: Applications—text processingGeneral Terms: Algorithms, TheoryAdditional Key Words and Phrases: Backtracking, dynamic programming, memoization,tabulation, tokenization1. INTRODUCTIONThe lexical-analysis (or scanning) phase of a compiler attempts to partitionan input string into a sequence of tokens. The convention in most lan-guages is that the input is scanned left to right, and each token identified isa “maximal munch” of the remaining input—the longest prefix of theThis work was supported in part by the National Science Foundation under grant CCR-9625667 and by the Defense Advanced Research Projects Agency (monitored by the Office ofNaval Research under contracts N00014-92-J-1937 and N00014-97-1-0114). The U.S. Govern-ment is authorized to reproduce and distribute reprints for Governmental purposes, notwith-standing any copyright notices affixed thereon. The views and conclusions contained hereinare those of the authors, and should not be interpreted as necessarily representing the officialpolicies or endorsements, either expressed or implied, of the above government agencies or theU.S. Government.Author’s address: Computer Sciences Department, University of Wisconsin, 1210 West DaytonStreet, Madison, WI 53706; email: [email protected] to make digital /hard copy of part or all of this work for personal or classroom useis granted without fee provided that the copies are not made or distributed for profit orcommercial advantage, the copyright notice, the title of the publication, and its date appear,and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, torepublish, to post on servers, or to redistribute to lists, requires prior specific permissionand/or a fee.© 1998 ACM 0164-0925/98/0300 –0259 $5.00ACM Transactions on Programming Languages and Systems, Vol. 20, No. 2, March 1998, Pages 259–273.remaining input that is a token of the language. For example, the string“12”1should be tokenized as a single integer token “12,” rather than as thejuxtaposition of two integer tokens “1” and “2.” Similarly, “123.456e210”should be recognized as one floating-point numeral rather than, say, thejuxtaposition of the four tokens “123,” “.456,” “e,” and “210.” (Usually,there is also a rule that when two token definitions match the same string,the earliest token definition takes precedence. However, this rule is in-voked only if there is a tie over the longest match.)Most textbooks on compiling have extensive discussions of lexical analy-sis in terms of finite-state automata and regular expressions: token classesare defined by a set of regular expressions Ri,1# i # k, and the lexicalanalyzer is based on some form of finite-state automaton for recognizingthe language L(R11 R21...1 Rk). However, the treatment isunsatisfactory in one respect: the theory of finite-state automata assumesthat the end of the input string—i.e., the right-hand-side boundary of thecandidate for recognition—is known a priori, whereas a scanner mustidentify the next token without knowing a definite bound on the extent ofthe token.Most of the standard compiler textbooks discuss this issue briefly.2Forexample, Aho and Ullman’s 1977 book discusses the issue in the context ofa lexical analyzer based on deterministic finite-state automata (DFAs) [Ahoand Ullman 1977, pp. 109–110]:There are several nuances in this procedure of which the reader should beaware. First, there are in the combined NFA several different “acceptingstates”. That is, the accepting state of each Niindicates that its own token,Pi, has been found. When we convert to a DFA, the subsets we constructmay include several different final states. Moreover, the final states losesome of their significance, since we are looking for the longest prefix of theinput which matches some pattern. After reaching a final state, the lexicalanalyzer must continue to simulate the DFA until it reaches a state with nonext state for the current input symbol. Let us say we reach terminationwhen we meet an input symbol from which the DFA cannot proceed. Wemust presume that the programming language is designed so that a validprogram cannot entirely fill the input buffer. . .without reaching termina-tion. . .Upon reaching termination, it is necessary to review the states of theDFA which we have entered while processing the input. Each such staterepresents a subset of the NFA’s states, and we look for the last DFA statewhich includes a final state for one of the pattern-recognizing NFA’s Ni.That final state indicates which token we have found. If none of the stateswhich the DFA has entered includes any final states of the NFA, then wehave an error condition. If the last DFA state to include a final NFA statein fact includes more than one final state, then the final state for thepattern listed first has priority.1Unless otherwise stated, throughout this article the open and closing quotation marks hereshould not be considered part of the string being referred to.2Such as Aho and Ullman [1977], Waite and Goos [1983], Aho et al. [1986], Fischer andLeBlanc [1988], and Wilhelm and Maurer [1995].260 • Thomas RepsACM Transactions on Programming Languages and Systems, Vol. 20, No. 2, March 1998.The discussion of the problem in the 1986 book by Aho, Sethi, and Ullmanis similar, except that they assume that


View Full Document

LSU CSC 4351 - Tokenization in Linear Time

Download Tokenization in Linear Time
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Tokenization in Linear Time and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Tokenization in Linear Time 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?