How to use the StreamTokenizer object to implement an interactive calculator
Last month I looked at the classes that Java provides to do basic lexical analysis. This month Iโll walk through a simple application that uses StreamTokenizer to implement an interactive calculator.
To review last monthโs article briefly, there are two lexical-analyzer classes that are included with the standard Java distribution: StringTokenizer and StreamTokenizer. These analyzers convert their input into discrete tokens that a parser can use to understand a given input. The parser implements a grammar, which is defined as one or more goal states reached by seeing various sequences of tokens. When a parserโs goal state is reached, it executes some action. When the parser detects that there are no possible goal states given the current sequence of tokens, it defines this as an error state. When a parser reaches an error state, it executes a recovery action, which gets the parser back to a point at which it can begin parsing again. Typically, this is implemented by consuming tokens until the parser is back to a valid starting point.
Last month I showed you some methods that used a StringTokenizer to parse some input parameters. This month Iโll show you an application that uses a StreamTokenizer object to parse an input stream and implement an interactive calculator.
Building an application
Our example is an interactive calculator that is similar to the Unix bc(1) command. As youโll see, it pushes the StreamTokenizer class right to the edge of its utility as a lexical analyzer. Thus, it serves as a good demonstration of where the line between โsimpleโ and โcomplexโ analyzers can be drawn. This example is a Java application and therefore runs best from the command line.
As a quick summary of its abilities, the calculator accepts expressions in the form
[variable name] "=" expression
The variable name is optional and can be any string of characters in the default word range. (You can use the exerciser applet from last monthโs article to refresh your memory on these characters.) If the variable name is omitted, the value of the expression simply is printed. If the variable name is present, the value of the expression is assigned to the variable. Once variables have been assigned to, they can be used in later expressions. Thus, they fill the role of โmemoriesโ on a modern hand-held calculator.
The expression is composed of operands in the form of numeric constants (double-precision, floating-point constants) or variable names, operators, and parentheses for grouping particular computations. The legal operators are addition (+), subtraction (-), multiplication (*), division (/), bitwise AND (&), bitwise OR (|), bitwise XOR (#), exponentiation (^), and unary negation with either minus (-) for the twos complement result or bang (!) for the ones complement result.
In addition to these statements, our calculator application also can take one of four commands: โdump,โ โclear,โ โhelp,โ and โquit.โ The dump command prints out all of the variables that are currently defined as well as their values. The clear command erases all of the currently-defined variables. The help command prints out a few lines of help text to get the user started. The quit command causes the application to exit.
The entire example application consists of two parsers โ one for commands and statements, and one for expressions.
Building a command parser
The command parser is implemented in the application class for the example STExample.java. (See the Resources section for a pointer to the code.) The main method for that class is defined below. Iโll walk through the pieces for you.
1 public static void main(String args[]) throws IOException {
2 Hashtable variables = new Hashtable();
3 StreamTokenizer st = new StreamTokenizer(System.in);
4 st.eolIsSignificant(true);
5 st.lowerCaseMode(true);
6 st.ordinaryChar('/');
7 st.ordinaryChar('-');
In the code above the first thing I do is allocate a java.util.Hashtable class to hold the variables. After that I allocate a StreamTokenizer and adjust it slightly from its defaults. The rationale for the changes are as follows:
eolIsSignificant is set to true so that the tokenizer will return an indication of end of line. I use the end of the line as the point where the expression ends.
lowerCaseMode is set to true so that the variable names will always be returned in lowercase. This way, variable names are not case-sensitive.
The slash character (/) is set to be an ordinary character so that it will not be used to indicate the start of a comment, and can be used instead as the division operator.
The minus character (-) is set to be an ordinary character so that the string โ3-3โ will segment into three tokens โ โ3โ, โ-โ, and โ3โ โ rather than just โ3โ and โ-3.โ (Remember, number parsing is set to โonโ by default.)
Once the tokenizer is set up, the command parser runs in an infinite loop (until it recognizes the โquitโ command at which point it exits). This is shown below.
8 while (true) {
9 Expression res;
10 int c = StreamTokenizer.TT_EOL;
11 String varName = null;
12
13 System.out.println("Enter an expression...");
14 try {
15 while (true) {
16 c = st.nextToken();
17 if (c == StreamTokenizer.TT_EOF) {
18 System.exit(1);
19 } else if (c == StreamTokenizer.TT_EOL) {
20 continue;
21 } else if (c == StreamTokenizer.TT_WORD) {
22 if (st.sval.compareTo("dump") == 0) {
23 dumpVariables(variables);
24 continue;
25 } else if (st.sval.compareTo("clear") == 0) {
26 variables = new Hashtable();
27 continue;
28 } else if (st.sval.compareTo("quit") == 0) {
29 System.exit(0);
30 } else if (st.sval.compareTo("exit") == 0) {
31 System.exit(0);
32 } else if (st.sval.compareTo("help") == 0) {
33 help();
34 continue;
35 }
36 varName = st.sval;
37 c = st.nextToken();
38 }
39 break;
40 }
41 if (c != '=') {
42 throw new SyntaxError("missing initial '=' sign.");
43 }
As you can see in line 16, the first token is called by invoking nextToken on the StreamTokenizer object. This returns a value indicating the kind of token that was scanned. The return value either will be one of the defined constants in the StreamTokenizer class or it will be a character value. The โmetaโ tokens (those that are not simply character values) are defined as follows:
TT_EOFโ This indicates you are at the end of the input stream. UnlikeStringTokenizer, there is nohasMoreTokensmethod.TT_EOLโ This tells you that the object has just passed an end-of-line sequence.TT_NUMBERโ This token type tells your parser code that a number has been seen on the input.TT_WORDโ This token type indicates an entire โwordโ was scanned.
When the result is not one of the above constants, it is either the character value representing a character in the โordinaryโ character range that was scanned or one of the quote characters you have set. (In my case, no quote character is set.) When the result is one of your quote characters, the quoted string can be found in the string instance variable sval of the StreamTokenizer object.
The code in lines 17 through 20 deals with end-of-line and end-of-file indications, whereas in line 21 the if clause is taken if a word token was returned. In this simple example, the word is either a command or a variable name. Lines 22 through 35 deal with the four possible commands. If line 36 is reached, then it must be a variable name; consequently, the program keeps a copy of the variable name and gets the next token, which must be an equal sign.
If in line 41 the token wasnโt an equal sign, our simple parser detects an error state and throws an exception to signal it. I created two generic exceptions, SyntaxError and ExecError, to distinguish parse-time errors from run-time errors. The main method continues with line 44 below.
44 res = ParseExpression.expression(st);
45 } catch (SyntaxError se) {
46 res = null;
47 varName = null;
48 System.out.println("nSyntax Error detected! - "+se.getMsg());
49 while (c != StreamTokenizer.TT_EOL)
50 c = st.nextToken();
51 continue;
52 }
In line 44 the expression to the right of the equal sign is parsed with the expression parser defined in the ParseExpression class. Note that lines 14 through 44 are wrapped in a try/catch block that traps syntax errors and deals with them. When an error is detected, the parserโs recovery action is to consume all tokens up to and including the next end-of-line token. This is shown in lines 49 and 50 above.
At this point, if an exception was not thrown, the application has successfully parsed a statement. The final check is to see that the next token is the end of line. If it isnโt, an error has gone undetected. The most common error will be mismatched parentheses. This check is shown in lines 53 through 60 of the code below.
53 c = st.nextToken();
54 if (c != StreamTokenizer.TT_EOL) {
55 if (c == ')')
56 System.out.println("nSyntax Error detected! - To many closing parens.");
57 else
58 System.out.println("nBogus token on input - "+c);
59 while (c != StreamTokenizer.TT_EOL)
60 c = st.nextToken();
61 } else {
When the next token is an end of line, the program executes lines 62 through 69 (shown below). This section of the method evaluates the parsed expression. If the variable name was set in line 36, the result is stored in the symbol table. In either case, if no exception is thrown, the expression and its value are printed to the System.out stream so that you can see what the parser decoded.
62 try {
63 Double z;
64 System.out.println("Parsed expression : "+res.unparse());
65 z = new Double(res.value(variables));
66 System.out.println("Value is : "+z);
67 if (varName != null) {
68 variables.put(varName, z);
69 System.out.println("Assigned to : "+varName);
70 }
71 } catch (ExecError ee) {
72 System.out.println("Execution error, "+ee.getMsg()+"!");
73 }
74 }
75 }
76 }
In the STExample class, the StreamTokenizer is being used by a command-processor parser. This type of parser commonly would be used in a shell program or in any situation in which the user issues commands interactively. The second parser is encapsulated in the ParseExpression class. (See the Resources section for the complete source.) This class parses the calculatorโs expressions and is invoked in line 44 above. It is here that StreamTokenizer faces its stiffest challenge.
Building an expression parser
The grammar for the calculatorโs expressions defines an algebraic syntax of the form โ[item] operator [item].โ This type of grammar comes up again and again and is called an operator grammar. A convenient notation for an operator grammar is:
id ( "OPERATOR" id )*
The code above would be read โAn ID terminal followed by zero or more occurrences of an operator-id tuple.โ The StreamTokenizer class would seem pretty ideal for analyzing such streams, because the design naturally breaks the input stream into word, number, and ordinary character tokens. As Iโll show you, this is true up to a point.
The ParseExpression class is a straightforward, recursive-descent parser for expressions, right out of an undergraduate compiler-design class. The Expression method in this class is defined as follows:
1 static Expression expression(StreamTokenizer st) throws SyntaxError {
2 Expression result;
3 boolean done = false;
4
5 result = sum(st);
6 while (! done) {
7 try {
8 switch (st.nextToken()) {
9 case '&' :
10 result = new Expression(OP_AND, result, sum(st));
11 break;
12 case '|' :
13 result = new Expression(OP_IOR, result, sum(st));
14 break;
15 case '#' :
16 result = new Expression(OP_XOR, result, sum(st));
17 break;
18 default:
19 done = true;
20 st.pushBack();
21 break;
22 }
23 } catch (IOException ioe) {
24 throw new SyntaxError("Got an I/O Exception.");
25 }
26 }
27 return result;
28 }
The above method is responsible for parsing the bitwise logical operators. If you recall the article about Jack, this grammar is nearly identical except for some of the operators. In this grammar the logical operators are presumed to exist between two sum non-terminals. The sum non-terminals are parsed by calling the sum method in line 5 and in lines 10, 13, and 16. To understand why this parser isnโt written as one giant switch statement for all of the operators, you will have to go back and read the Jack article.
The expression method works by presuming that the input stream contains the sequence โsum [ logical operator ] sum.โ It starts by parsing a sum non-terminal in line 5; then it looks at the next token in line 8. If the token is one of the logical operators, the code creates a new Expression tuple consisting of the operation code and the previous result, and then parses another sum non-terminal. If the token was not one of the logical operators, the method pushBack in the StreamTokenizer class is used to put the last token read โbackโ into the queue to be read again, and the loop exits.
The sum method is identical to the expression method, except that instead of calling sum and expecting logical operators, it calls term and expects the plus or minus operators. This is repeated for the term method, which calls factor and trys to parse the multiplication and division operators. Defining the parser this way establishes the precedence relationship between the operators.
The factor method called by the term method is slightly different and is shown below.
1 static Expression factor(StreamTokenizer st) throws SyntaxError {
2 Expression result;
3
4 result = primary(st);
5 try {
6 switch (st.nextToken()) {
7 case '^':
8 result = new Expression(OP_EXP, result, factor(st));
9 break;
10 default:
11 st.pushBack();
12 break;
13 }
14 } catch (IOException ioe) {
15 throw new SyntaxError("Caught an I/O Exception.");
16 }
17 return result;
18 }
The factor method starts out similarly to the expression method, except that it is recursive to itself โ that is, it parses a primary in line 4, and then if it sees the exponentiation operator (^) in line 7, it creates a new Expression tuple in line 8. In the process of creating that tuple, it recurses to itself. By coding the method this way, the parse tree is constructed so that this operator is right-associative (that is, 2^3^4 is parsed as (2 ^ (3 ^ 4)) rather than ((2 ^ 3) ^ 4)).
The factor method was the method where the limitations of the StreamTokenizer became clear. With the factor method, I originally wanted to use the token โ**โ as the exponentiation operator. However, the implementation of StreamTokenizer made this impossible!
The problem lies in what is known as โlookahead.โ A lexical analyzer that supports lookahead can match a token whose first characters are identical to another token by looking ahead in the input stream to see which token is the โbestโ match. For lexical analyzers, the best match is defined as the longest match, or, stated a different way, the match that consumes the most input characters. So in this case, the tokens โ*โ for multiply and โ**โ for exponentiation are ambiguous unless the analyzer can look ahead when it encounters a โ*โ character. If the next character is also a โ*,โ both are consumed, and the token is the exponentiation operator. If the next character is not a โ*,โ then only one input character is consumed, and the token is the multiplication operator.
The concept of lookahead does not fit into the StreamTokenizer notion of the world. I tried three different ways of finessing this, all of which failed. Attempt #1 temporarily changed the value of โ*โ to a word character so that the tokenizer would return a TT_WORD token if either โ*โ or โ**โ was scanned. However, this also caused the expression 2**a to be parsed as โ2โ and โ**a,โ which was unacceptable. Attempt #2 involved using the mark and reset methods of InputStream to โresetโ the stream into the StreamTokenizer after looking ahead manually. However, this had the deleterious side effect of causing duplicate accumulation of characters. The final solution attempt was to invoke the pushBack method multiple times to back up the stream to a point before the first โ*โ was encountered. But the implementation of pushBack in StreamTokenizer is such that regardless of the number of times you call it, only the last token is pushed back into the stream.
I concluded that this conceptual issue of lookahead is the dividing line between a โsimpleโ lexical analyzer and a โcomplexโ one. Thus, I simply changed the operators so that they were all one character. In a more complex environment where I might want to have && and || be operators in my grammar, using StreamTokenizer clearly would not be possible.
Wrapping up
The example application shows two ways of using the StreamTokenizer class in conjunction with a parser. In the expression parsing component, the limitations of StreamTokenizer were reached, and some analysis of the implications of those limits was provided. Understanding both StringTokenizer and StreamTokenizer and applying them will save you time and effort in your Java programming.
In the example, the return result of the expression method of the ParseExpression class is a parse tree. In a compiler, this data structure would be used to generate code that evaluated the expression. In my example, I use it simply to evaluate the expression in place. If this were a spreadsheet application, the contents of a cell would be the parse tree that could be reevaluated when the contents of the variables (other cells) changed. The method unparse in the Expression class can be used to re-create the expression from the parse tree. Because the original parentheses were lost when the expression was parsed, the unparse method explicitly adds back more parentheses than are strictly necessary to convey the evaluation order of the expression.


