TokenizersTokensTokenizers as state machinesTokenTypeTokenAdditions to the Token classThe constructor and hasNext()The shell of next()The READY stateThe IN_NUMBER stateThe IN_VARIABLE stateThe default casejava.util.StringTokenizerjava.io.StreamTokenizerThe EndJan 14, 2019TokenizersTokensA tokenizer is a program that extracts tokens from an input streamA token has two parts:Its valueIts kind, or typeFor example, if we tokenize "while (x >= 0)" we might get these tokens:"while", keyword"(", punctuation"x", name">=", operator"0", integer")", punctuationTokenizers as state machinesTokenizers can be implemented as state machines, but with these important differences:To succ e ed (recognize a token), the tokenizer does not have to reach the end of input; it only has to reach a final stateWhen the tokenizer returns a token, the remainder of the input string is kept for use in getting the remaining tokensTokenizers are almost always implemented as state machinesWe’ll do a quick tokenizer to recognize tokens in arithmetic expressions:Integers (digits only)Variables (letters and digits, starting with a letter)Operators, + - * / %Parentheses, ( )Errors (anything not in the above list)TokenTypepublic enum TokenType { INTEGER, VARIABLE, OPERATOR, PARENTHESIS, ERROR;}Tokenpublic class Token { private TokenType type; private String value; public Token(TokenType type, String value) { this.type = type; this.value = value; } public TokenType getType() { return type; } public String getValue() { return value; }}Additions to the Token classFor my JUnit testing, I needed to ask whether my Tokenizer was returning the correct Tokenspublic boolean equals(Object object) { Token that = (Token)object; return this.type == that.type && this.value.equals(that.value);}Since my tests were failing, I wanted to see what tokens I was actually gettingpublic String toString() { return value + ":" + type;}The constructor and hasNext()public class Tokenizer { private String input; private int position; public Tokenizer(String input) { this.input = input.trim() + " "; // to simplify getting last token position = -1; } public boolean hasNext() { return position < input.length() - 2; } public Token next() { ... }}The shell of next()public class Tokenizer { private enum States { READY, IN_NUMBER, IN_VARIABLE, ERROR }; public Token next() { States state; String value = ""; if (!hasNext()) { throw new IllegalStateException("No more tokens!"); } state = States.READY; while ((++position) < input.length()) { char ch = input.charAt(position); switch (state) { case READY: { ... } case IN_VARIABLE: { ... } case IN_NUMBER: { ... } default: { ... } return new Token(TokenType.ERROR, value); } } assert false; // should never get here return null; }}The READY statecase READY: value = ch + ""; if (Character.isWhitespace(ch)) break; if ("()".contains(ch + "")) { return new Token(TokenType.PARENTHESIS, value); } if ("+-*/%".contains(ch + "")) { return new Token(TokenType.OPERATOR, value); } if (Character.isLetter(ch)) { state = States.IN_VARIABLE; break; } if (Character.isDigit(ch)) { state = States.IN_NUMBER; break; } return new Token(TokenType.ERROR, value);The IN_NUMBER statecase IN_NUMBER: if (Character.isDigit(ch)) { value += ch; break; } else { position--; // save char for next time return new Token(TokenType.INTEGER, value); }The IN_VARIABLE statecase IN_VARIABLE: if (Character.isLetter(ch) || Character.isDigit(ch)) { value += ch; break; } else { position--; // save char for next time return new Token(TokenType.VARIABLE, value); }The default casedefault: return new Token(TokenType.ERROR, value);java.util.StringTokenizerStringTokenizer is a trivial tokenizer provided by SunEverything is either a “token” or a “delimiter”The most important methods are hasMoreTokens() and nextToken()There are three constructors:StringTokenizer(String str)Delimiters are whitespace characters; any sequence of non-whitespace characters is returned as a tokenStringTokenizer(String str, String delim)Same as above, except you get to specify which characters are delimitersStringTokenizer(String str, String delim, boolean returnDelims)Same as above, except you get to say you also want the delimiters returned as tokensjava.io.StreamTokenizerStreamTokenizer is a much more powerful (and much more complex) tokenizerIt is basically capable of tokenizing C and Java programs, including integers, doubles, and commentsThere are a large number of possible settings, so that the tokenizer can be customizedThe constructor is StreamTokenizer(Reader r), where Reader is an abstract class for reading character streamsThe most important method is int nextToken(), where the returned int tells you what kind of token it foundOnce you know what kind of token has been found, you access fields of the tokenizer to get its valueI’m not going to cover StreamTokenizer in my lecturesAll the details are in the Java APIYou may want to use StreamTokenizer in subsequent assignmentsThe
View Full Document