Lexical analysis, Part 2: Build an application

How to use the StreamTokenizer object to implement an interactive calculator

Last month I looked at the classes that Java provides to do basic lexical analysis. This month I’ll walk through a simple application that uses StreamTokenizer to implement an interactive calculator.

To review last month’s article briefly, there are two lexical-analyzer classes that are included with the standard Java distribution: StringTokenizer and StreamTokenizer. These analyzers convert their input into discrete tokens that a parser can use to understand a given input. The parser implements a grammar, which is defined as one or more goal states reached by seeing various sequences of tokens. When a parser’s goal state is reached, it executes some action. When the parser detects that there are no possible goal states given the current sequence of tokens, it defines this as an error state. When a parser reaches an error state, it executes a recovery action, which gets the parser back to a point at which it can begin parsing again. Typically, this is implemented by consuming tokens until the parser is back to a valid starting point.

Last month I showed you some methods that used a StringTokenizer to parse some input parameters. This month I’ll show you an application that uses a StreamTokenizer object to parse an input stream and implement an interactive calculator.

Building an application

Our example is an interactive calculator that is similar to the Unix bc(1) command. As you’ll see, it pushes the StreamTokenizer class right to the edge of its utility as a lexical analyzer. Thus, it serves as a good demonstration of where the line between “simple” and “complex” analyzers can be drawn. This example is a Java application and therefore runs best from the command line.

As a quick summary of its abilities, the calculator accepts expressions in the form

[variable name] "=" expression

The variable name is optional and can be any string of characters in the default word range. (You can use the exerciser applet from last month’s article to refresh your memory on these characters.) If the variable name is omitted, the value of the expression simply is printed. If the variable name is present, the value of the expression is assigned to the variable. Once variables have been assigned to, they can be used in later expressions. Thus, they fill the role of “memories” on a modern hand-held calculator.

The expression is composed of operands in the form of numeric constants (double-precision, floating-point constants) or variable names, operators, and parentheses for grouping particular computations. The legal operators are addition (+), subtraction (-), multiplication (*), division (/), bitwise AND (&), bitwise OR (|), bitwise XOR (#), exponentiation (^), and unary negation with either minus (-) for the twos complement result or bang (!) for the ones complement result.

In addition to these statements, our calculator application also can take one of four commands: “dump,” “clear,” “help,” and “quit.” The dump command prints out all of the variables that are currently defined as well as their values. The clear command erases all of the currently-defined variables. The help command prints out a few lines of help text to get the user started. The quit command causes the application to exit.

The entire example application consists of two parsers — one for commands and statements, and one for expressions.

Building a command parser

The command parser is implemented in the application class for the example STExample.java. (See the Resources section for a pointer to the code.) The main method for that class is defined below. I’ll walk through the pieces for you.

 
 1    public static void main(String args[]) throws IOException {
 2        Hashtable variables = new Hashtable();
 3        StreamTokenizer st = new StreamTokenizer(System.in);
 4        st.eolIsSignificant(true);
 5        st.lowerCaseMode(true);
 6        st.ordinaryChar('/');
 7        st.ordinaryChar('-');

In the code above the first thing I do is allocate a java.util.Hashtable class to hold the variables. After that I allocate a StreamTokenizer and adjust it slightly from its defaults. The rationale for the changes are as follows:

eolIsSignificant is set to true so that the tokenizer will return an indication of end of line. I use the end of the line as the point where the expression ends.
lowerCaseMode is set to true so that the variable names will always be returned in lowercase. This way, variable names are not case-sensitive.
The slash character (/) is set to be an ordinary character so that it will not be used to indicate the start of a comment, and can be used instead as the division operator.
The minus character (-) is set to be an ordinary character so that the string “3-3” will segment into three tokens — “3”, “-“, and “3” — rather than just “3” and “-3.” (Remember, number parsing is set to “on” by default.)

Once the tokenizer is set up, the command parser runs in an infinite loop (until it recognizes the “quit” command at which point it exits). This is shown below.

 
 8        while (true) {
 9            Expression res;
10            int c = StreamTokenizer.TT_EOL;
11            String varName = null;
12
13            System.out.println("Enter an expression...");
14            try {
15                while (true) {
16                    c = st.nextToken();
17                    if (c == StreamTokenizer.TT_EOF) {
18                        System.exit(1);
19                    } else if (c == StreamTokenizer.TT_EOL) {
20                        continue;
21                    } else if (c == StreamTokenizer.TT_WORD) {
22                        if (st.sval.compareTo("dump") == 0) {
23                            dumpVariables(variables);
24                            continue;
25                        } else if (st.sval.compareTo("clear") == 0) {
26                            variables = new Hashtable();
27                            continue;
28                        } else if (st.sval.compareTo("quit") == 0) {
29                            System.exit(0);
30                        } else if (st.sval.compareTo("exit") == 0) {
31                            System.exit(0);
32                        } else if (st.sval.compareTo("help") == 0) {
33                            help();
34                            continue;
35                        }
36                        varName = st.sval;
37                        c = st.nextToken();
38                    }
39                    break;
40                }
41                if (c != '=') {
42                    throw new SyntaxError("missing initial '=' sign.");
43                }

As you can see in line 16, the first token is called by invoking nextToken on the StreamTokenizer object. This returns a value indicating the kind of token that was scanned. The return value either will be one of the defined constants in the StreamTokenizer class or it will be a character value. The “meta” tokens (those that are not simply character values) are defined as follows:

TT_EOF — This indicates you are at the end of the input stream. Unlike StringTokenizer, there is no hasMoreTokens method.
TT_EOL — This tells you that the object has just passed an end-of-line sequence.
TT_NUMBER — This token type tells your parser code that a number has been seen on the input.
TT_WORD — This token type indicates an entire “word” was scanned.

When the result is not one of the above constants, it is either the character value representing a character in the “ordinary” character range that was scanned or one of the quote characters you have set. (In my case, no quote character is set.) When the result is one of your quote characters, the quoted string can be found in the string instance variable sval of the StreamTokenizer object.

The code in lines 17 through 20 deals with end-of-line and end-of-file indications, whereas in line 21 the if clause is taken if a word token was returned. In this simple example, the word is either a command or a variable name. Lines 22 through 35 deal with the four possible commands. If line 36 is reached, then it must be a variable name; consequently, the program keeps a copy of the variable name and gets the next token, which must be an equal sign.

If in line 41 the token wasn’t an equal sign, our simple parser detects an error state and throws an exception to signal it. I created two generic exceptions, SyntaxError and ExecError, to distinguish parse-time errors from run-time errors. The main method continues with line 44 below.

44                res = ParseExpression.expression(st);
45            } catch (SyntaxError se) {
46                res = null;
47                varName = null;
48                System.out.println("nSyntax Error detected! - "+se.getMsg());
49                while (c != StreamTokenizer.TT_EOL)
50                    c = st.nextToken();
51                continue;
52            }

In line 44 the expression to the right of the equal sign is parsed with the expression parser defined in the ParseExpression class. Note that lines 14 through 44 are wrapped in a try/catch block that traps syntax errors and deals with them. When an error is detected, the parser’s recovery action is to consume all tokens up to and including the next end-of-line token. This is shown in lines 49 and 50 above.

At this point, if an exception was not thrown, the application has successfully parsed a statement. The final check is to see that the next token is the end of line. If it isn’t, an error has gone undetected. The most common error will be mismatched parentheses. This check is shown in lines 53 through 60 of the code below.

53            c = st.nextToken();
54            if (c != StreamTokenizer.TT_EOL) {
55                if (c == ')')
56                    System.out.println("nSyntax Error detected! - To many closing parens.");
57                else
58                    System.out.println("nBogus token on input - "+c);
59                while (c != StreamTokenizer.TT_EOL)
60                    c = st.nextToken();
61            } else {

When the next token is an end of line, the program executes lines 62 through 69 (shown below). This section of the method evaluates the parsed expression. If the variable name was set in line 36, the result is stored in the symbol table. In either case, if no exception is thrown, the expression and its value are printed to the System.out stream so that you can see what the parser decoded.

62                try {
63                    Double z;
64                    System.out.println("Parsed expression : "+res.unparse());
65                    z = new Double(res.value(variables));
66                    System.out.println("Value is : "+z);
67                    if (varName != null) {
68                        variables.put(varName, z);
69                        System.out.println("Assigned to : "+varName);
70                    }
71                } catch (ExecError ee) {
72                    System.out.println("Execution error, "+ee.getMsg()+"!");
73                }
74            }
75        }
76    }

In the STExample class, the StreamTokenizer is being used by a command-processor parser. This type of parser commonly would be used in a shell program or in any situation in which the user issues commands interactively. The second parser is encapsulated in the ParseExpression class. (See the Resources section for the complete source.) This class parses the calculator’s expressions and is invoked in line 44 above. It is here that StreamTokenizer faces its stiffest challenge.

Building an expression parser

The grammar for the calculator’s expressions defines an algebraic syntax of the form “[item] operator [item].” This type of grammar comes up again and again and is called an operator grammar. A convenient notation for an operator grammar is:

id ( "OPERATOR" id )*

The code above would be read “An ID terminal followed by zero or more occurrences of an operator-id tuple.” The StreamTokenizer class would seem pretty ideal for analyzing such streams, because the design naturally breaks the input stream into word, number, and ordinary character tokens. As I’ll show you, this is true up to a point.

The ParseExpression class is a straightforward, recursive-descent parser for expressions, right out of an undergraduate compiler-design class. The Expression method in this class is defined as follows:

 
 1    static Expression expression(StreamTokenizer st) throws SyntaxError {
 2        Expression result;
 3        boolean done = false;
 4
 5        result = sum(st);
 6        while (! done) {
 7            try {
 8                switch (st.nextToken()) {
 9                    case '&' :
10                        result = new Expression(OP_AND, result, sum(st));
11                        break;
12                    case '|' :
13                        result = new Expression(OP_IOR, result, sum(st));
14                        break;
15                    case '#' :
16                        result = new Expression(OP_XOR, result, sum(st));
17                        break;
18                    default:
19                        done = true;
20                        st.pushBack();
21                        break;
22                }
23            } catch (IOException ioe) {
24                throw new SyntaxError("Got an I/O Exception.");
25            }
26        }
27        return result;
28    }

The above method is responsible for parsing the bitwise logical operators. If you recall the article about Jack, this grammar is nearly identical except for some of the operators. In this grammar the logical operators are presumed to exist between two sum non-terminals. The sum non-terminals are parsed by calling the sum method in line 5 and in lines 10, 13, and 16. To understand why this parser isn’t written as one giant switch statement for all of the operators, you will have to go back and read the Jack article.

The expression method works by presuming that the input stream contains the sequence “sum [ logical operator ] sum.” It starts by parsing a sum non-terminal in line 5; then it looks at the next token in line 8. If the token is one of the logical operators, the code creates a new Expression tuple consisting of the operation code and the previous result, and then parses another sum non-terminal. If the token was not one of the logical operators, the method pushBack in the StreamTokenizer class is used to put the last token read “back” into the queue to be read again, and the loop exits.

The sum method is identical to the expression method, except that instead of calling sum and expecting logical operators, it calls term and expects the plus or minus operators. This is repeated for the term method, which calls factor and trys to parse the multiplication and division operators. Defining the parser this way establishes the precedence relationship between the operators.

The factor method called by the term method is slightly different and is shown below.

 
 1    static Expression factor(StreamTokenizer st) throws SyntaxError {
 2        Expression result;
 3
 4        result = primary(st);
 5        try {
 6            switch (st.nextToken()) {
 7                case '^':
 8                    result = new Expression(OP_EXP, result, factor(st));
 9                    break;
10                default:
11                    st.pushBack();
12                    break;
13            }
14        } catch (IOException ioe) {
15            throw new SyntaxError("Caught an I/O Exception.");
16        }
17        return result;
18    }

The factor method starts out similarly to the expression method, except that it is recursive to itself — that is, it parses a primary in line 4, and then if it sees the exponentiation operator (^) in line 7, it creates a new Expression tuple in line 8. In the process of creating that tuple, it recurses to itself. By coding the method this way, the parse tree is constructed so that this operator is right-associative (that is, 2^3^4 is parsed as (2 ^ (3 ^ 4)) rather than ((2 ^ 3) ^ 4)).

The factor method was the method where the limitations of the StreamTokenizer became clear. With the factor method, I originally wanted to use the token “**” as the exponentiation operator. However, the implementation of StreamTokenizer made this impossible!

The problem lies in what is known as “lookahead.” A lexical analyzer that supports lookahead can match a token whose first characters are identical to another token by looking ahead in the input stream to see which token is the “best” match. For lexical analyzers, the best match is defined as the longest match, or, stated a different way, the match that consumes the most input characters. So in this case, the tokens “*” for multiply and “**” for exponentiation are ambiguous unless the analyzer can look ahead when it encounters a “*” character. If the next character is also a “*,” both are consumed, and the token is the exponentiation operator. If the next character is not a “*,” then only one input character is consumed, and the token is the multiplication operator.

The concept of lookahead does not fit into the StreamTokenizer notion of the world. I tried three different ways of finessing this, all of which failed. Attempt #1 temporarily changed the value of “*” to a word character so that the tokenizer would return a TT_WORD token if either “*” or “**” was scanned. However, this also caused the expression 2**a to be parsed as “2” and “**a,” which was unacceptable. Attempt #2 involved using the mark and reset methods of InputStream to “reset” the stream into the StreamTokenizer after looking ahead manually. However, this had the deleterious side effect of causing duplicate accumulation of characters. The final solution attempt was to invoke the pushBack method multiple times to back up the stream to a point before the first “*” was encountered. But the implementation of pushBack in StreamTokenizer is such that regardless of the number of times you call it, only the last token is pushed back into the stream.

I concluded that this conceptual issue of lookahead is the dividing line between a “simple” lexical analyzer and a “complex” one. Thus, I simply changed the operators so that they were all one character. In a more complex environment where I might want to have && and || be operators in my grammar, using StreamTokenizer clearly would not be possible.

Wrapping up

The example application shows two ways of using the StreamTokenizer class in conjunction with a parser. In the expression parsing component, the limitations of StreamTokenizer were reached, and some analysis of the implications of those limits was provided. Understanding both StringTokenizer and StreamTokenizer and applying them will save you time and effort in your Java programming.

In the example, the return result of the expression method of the ParseExpression class is a parse tree. In a compiler, this data structure would be used to generate code that evaluated the expression. In my example, I use it simply to evaluate the expression in place. If this were a spreadsheet application, the contents of a cell would be the parse tree that could be reevaluated when the contents of the variables (other cells) changed. The method unparse in the Expression class can be used to re-create the expression from the parse tree. Because the original parentheses were lost when the expression was parsed, the unparse method explicitly adds back more parentheses than are strictly necessary to convey the evaluation order of the expression.

Chuck McManis currently is the director of system software at FreeGate Corp., a venture-funded start-up that is exploring opportunities in the Internet marketplace. Before joining FreeGate, Chuck was a member of the Java Group. He joined the Java Group just after the formation of FirstPerson Inc. and was a member of the portable OS group (the group responsible for the OS portion of Java). Later, when FirstPerson was dissolved, he stayed with the group through the development of the alpha and beta versions of the Java platform. He created the first “all Java” home page on the Internet when he did the programming for the Java version of the Sun home page in May 1995. He also developed a cryptographic library for Java and versions of the Java class loader that could screen classes based on digital signatures. Before joining FirstPerson, Chuck worked in the operating systems area of SunSoft, developing networking applications, where he did the initial design of NIS+.

Topics

About

Policies

Our Network

More

Lexical analysis, Part 2: Build an application

How to use the StreamTokenizer object to implement an interactive calculator

Building an application

Building a command parser

Building an expression parser

Wrapping up

Show me more

Databricks adds Data Science Agent to automate analytics tasks

Rust Innovation Lab launched, sponsors first project

PostgreSQL 18 to boost OLTP performance, but misses AI readiness

Getting encryption wrong (and getting it right, too)

How to build a native desktop app vs. a web UI app

PyApp: Build click-to-run Python apps with Rust