DOC PREVIEW
Penn CIT 594 - Tokenizers

This preview shows page 1-2-3-4-5 out of 15 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

TokenizersTokensTokenizers as state machinesTokenTypeTokenAdditions to the Token classThe constructor and hasNext()The shell of next()The READY stateThe IN_NUMBER stateThe IN_VARIABLE stateThe default casejava.util.StringTokenizerjava.io.StreamTokenizerThe EndJan 14, 2019TokenizersTokensA tokenizer is a program that extracts tokens from an input streamA token has two parts:Its valueIts kind, or typeFor example, if we tokenize "while (x >= 0)" we might get these tokens:"while", keyword"(", punctuation"x", name">=", operator"0", integer")", punctuationTokenizers as state machinesTokenizers can be implemented as state machines, but with these important differences:To succ e ed (recognize a token), the tokenizer does not have to reach the end of input; it only has to reach a final stateWhen the tokenizer returns a token, the remainder of the input string is kept for use in getting the remaining tokensTokenizers are almost always implemented as state machinesWe’ll do a quick tokenizer to recognize tokens in arithmetic expressions:Integers (digits only)Variables (letters and digits, starting with a letter)Operators, + - * / %Parentheses, ( )Errors (anything not in the above list)TokenTypepublic enum TokenType { INTEGER, VARIABLE, OPERATOR, PARENTHESIS, ERROR;}Tokenpublic class Token { private TokenType type; private String value; public Token(TokenType type, String value) { this.type = type; this.value = value; } public TokenType getType() { return type; } public String getValue() { return value; }}Additions to the Token classFor my JUnit testing, I needed to ask whether my Tokenizer was returning the correct Tokenspublic boolean equals(Object object) { Token that = (Token)object; return this.type == that.type && this.value.equals(that.value);}Since my tests were failing, I wanted to see what tokens I was actually gettingpublic String toString() { return value + ":" + type;}The constructor and hasNext()public class Tokenizer { private String input; private int position; public Tokenizer(String input) { this.input = input.trim() + " "; // to simplify getting last token position = -1; } public boolean hasNext() { return position < input.length() - 2; } public Token next() { ... }}The shell of next()public class Tokenizer { private enum States { READY, IN_NUMBER, IN_VARIABLE, ERROR }; public Token next() { States state; String value = ""; if (!hasNext()) { throw new IllegalStateException("No more tokens!"); } state = States.READY; while ((++position) < input.length()) { char ch = input.charAt(position); switch (state) { case READY: { ... } case IN_VARIABLE: { ... } case IN_NUMBER: { ... } default: { ... } return new Token(TokenType.ERROR, value); } } assert false; // should never get here return null; }}The READY statecase READY: value = ch + ""; if (Character.isWhitespace(ch)) break; if ("()".contains(ch + "")) { return new Token(TokenType.PARENTHESIS, value); } if ("+-*/%".contains(ch + "")) { return new Token(TokenType.OPERATOR, value); } if (Character.isLetter(ch)) { state = States.IN_VARIABLE; break; } if (Character.isDigit(ch)) { state = States.IN_NUMBER; break; } return new Token(TokenType.ERROR, value);The IN_NUMBER statecase IN_NUMBER: if (Character.isDigit(ch)) { value += ch; break; } else { position--; // save char for next time return new Token(TokenType.INTEGER, value); }The IN_VARIABLE statecase IN_VARIABLE: if (Character.isLetter(ch) || Character.isDigit(ch)) { value += ch; break; } else { position--; // save char for next time return new Token(TokenType.VARIABLE, value); }The default casedefault: return new Token(TokenType.ERROR, value);java.util.StringTokenizerStringTokenizer is a trivial tokenizer provided by SunEverything is either a “token” or a “delimiter”The most important methods are hasMoreTokens() and nextToken()There are three constructors:StringTokenizer(String str)Delimiters are whitespace characters; any sequence of non-whitespace characters is returned as a tokenStringTokenizer(String str, String delim)Same as above, except you get to specify which characters are delimitersStringTokenizer(String str, String delim, boolean returnDelims)Same as above, except you get to say you also want the delimiters returned as tokensjava.io.StreamTokenizerStreamTokenizer is a much more powerful (and much more complex) tokenizerIt is basically capable of tokenizing C and Java programs, including integers, doubles, and commentsThere are a large number of possible settings, so that the tokenizer can be customizedThe constructor is StreamTokenizer(Reader r), where Reader is an abstract class for reading character streamsThe most important method is int nextToken(), where the returned int tells you what kind of token it foundOnce you know what kind of token has been found, you access fields of the tokenizer to get its valueI’m not going to cover StreamTokenizer in my lecturesAll the details are in the Java APIYou may want to use StreamTokenizer in subsequent assignmentsThe


View Full Document

Penn CIT 594 - Tokenizers

Documents in this Course
Trees

Trees

17 pages

Searching

Searching

24 pages

Pruning

Pruning

11 pages

Arrays

Arrays

17 pages

Stacks

Stacks

30 pages

Recursion

Recursion

25 pages

Hashing

Hashing

24 pages

Recursion

Recursion

24 pages

Graphs

Graphs

25 pages

Storage

Storage

37 pages

Trees

Trees

21 pages

Arrays

Arrays

24 pages

Hashing

Hashing

24 pages

Recursion

Recursion

25 pages

Graphs

Graphs

23 pages

Graphs

Graphs

25 pages

Stacks

Stacks

25 pages

Recursion

Recursion

25 pages

Quicksort

Quicksort

21 pages

Quicksort

Quicksort

21 pages

Graphs

Graphs

25 pages

Recursion

Recursion

25 pages

Searching

Searching

24 pages

Counting

Counting

20 pages

HTML

HTML

18 pages

Recursion

Recursion

24 pages

Pruning

Pruning

11 pages

Graphs

Graphs

25 pages

Load more
Download Tokenizers
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Tokenizers and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Tokenizers 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?