The two main approaches to writing a parser are:
The benefit of writing a parser by hand is that you have more control, especially when it comes to error messages for input that doesn't parse properly. They can also be a bit faster.
The benefit of using a parser generator is that it's usually much easier. Tools like Bison and ANTLR allow more grammar constructs than recursive descent, so you don't need to modify the language grammar as much. The parser is also much easier to change when using a generator.
We will use the ANTLR parser generator which is written in Java, and has been developed over many years.
We have already seen how ANTLR can be used to generate a lexer. There are two ways we can use ANTLR for lexing and parsing:
lexer grammar X;
where X
is the name of the file. Then
we have another file which begins with parser grammar Y;
where Y
is the name of that file.grammar Z;
where the file is called Z
.In either case, all lexer rules must begin with capital letters (and are generally written in all caps) while parser rules must begin with lower-case letters. This way, ANTLR can determine which rules go with the lexical analysis vs. syntax analysis stages.
Traditionally the syntax of a language is described in a context-free grammar using Backus-Naur Form (BNF). BNF really only allows grammar rules where one grammar symbol can be replaced with one of a set of strings of other symbols:
statements: statement statements
|
;
There is also extended BNF or EBNF, which allows for regex like rules that allow us to indicate that a symbol or set of symbols is optional, or may be repeated. ANTLR supports EBNF (though it uses different syntax), so we can re-write the above rule as:
statements: statement*;
ANTLR supports * (zero or more), + (one or more) and ? (zero or one) in its syntax rules notation.
ANTLR version 4 also allows the inclusion of direct left recursion,
even though it generates a top-down parser. Thus the left-recursive statements
rule above is valid in an ANTLR grammar. ANTLR supports this by re-writing our grammar
for us to remove the left recursion before generating the parser.
It cannot do this for indirect left-recursion, so that is disallowed. Here is an example of indirect left recursive rules:
expr: term '+' term
| term
;
term: factor '-' factor
| expr
;
This is indirect left recursion because expr
can expand to term
on the left, and term
in turn expands back to expr
on the left.
ANTLR gives us the following error in this case:
error(119): Test.g4::: The following sets of rules are mutually left-recursive [expr, term]
We can use ANTLR to create a calculator program using a lexer and parser. The lexer is in file CalculatorLexer.g4:
lexer grammar CalculatorLexer;
PLUS: '+';
MINUS: '-';
DIVIDE: '/';
TIMES: '*';
LPAREN: '(';
RPAREN: ')';
PI: 'pi' | 'PI';
REALVAL: [0-9]*'.'?[0-9]+;
WS: [ \t]+ -> skip;
Next we have the parser grammar that corresponds to this in CalculatorParser.g4:
parser grammar CalculatorParser;
options {
tokenVocab = CalculatorLexer;
}
expression: expression (TIMES | DIVIDE) expression
| expression (PLUS | MINUS) expression
| term
;
term: REALVAL
| PI
| LPAREN expression RPAREN
;
We can compile these into the lexer and parser Java programs with the Antlr tool.
This parser simply parses the expressions, it does not do any calculation yet. There are two ways we can instrument this parser to actually do calculations. We can put semantic actions in place for each rule, or we can use ANTLR's visitor system instead.
Semantic actions consist of some bit of code that we want the parser to execute each time it applies one of our grammar rules in the parse. In a programming language parser, these could be used to do type checking, generate assembly code, or any other action that needs to happen while parsing. For our calculator, we can use these to do the actual calculation.
The directory calculator-actions contains the files needed for the complete calculator using these semantic actions. In particular, here is the grammar file:
line: expression
{
System.out.println($expression.val);
}
;
expression returns [double val]:
a=expression op=(TIMES | DIVIDE) b=expression
{
if ($op.getType() == CalculatorParser.TIMES) {
$val = $a.val * $b.val;
} else {
$val = $a.val / $b.val;
}
}
| a=expression op=(PLUS | MINUS) b=expression
{
if ($op.getType() == CalculatorParser.PLUS) {
$val = $a.val + $b.val;
} else {
$val = $a.val - $b.val;
}
}
| term
{
$val = $term.val;
}
;
term returns [double val]:
num=REALVAL {$val = Double.parseDouble($num.getText());}
| PI {$val = Math.PI;}
| LPAREN expr=expression RPAREN {$val = $expr.val;}
;
The grammar has been instrumented with semantic actions (which appear within curly braces). Each time one of the rules is expanded, we run the code in the corresponding action. The code within braces can be any Java code, and can use the $ variables to reference terminal and non-terminal values.
The advantage of semantic actions is they happen during the parse, making them efficient. As we parse we apply the actions at the same time. The downside is that it makes the grammar more complex looking and harder to reader. It also ties is into one set of actions. With a programming language, we might want to check types, interpret and do other analysis.
ANTLR 4 introduces a new way of doing this, in which we separate out the grammar itself from any actions that happen during the parse. The grammar only contains syntax rules. After parsing, ANTLR maintains the syntax tree for the input, and then can automatically walk the tree, using the listener or visitor to trigger our actions. The difference between these:
The directory calculator-visitor contains the files needed for this version of the calculator.
In this version, the parser can contain mostly just the grammar:
parser grammar CalculatorParser;
options {
tokenVocab = CalculatorLexer;
}
expression: expression op=(TIMES | DIVIDE) expression # timesdiv
| expression op=(PLUS | MINUS) expression # plusminus
| term # justterm
;
term: REALVAL # realval
| PI # pival
| LPAREN expression RPAREN # parens
;
We do have the little tags on rules with multiple options, such as timesdiv
and plusminus
.
These are needed for ANTLR to generate our visitor. But in this approach the grammar is de-coupled from what we use
it for.
ANTLR uses this to make a Java class called CalculatorParserBaseVisitor
which contains methods
that will be called automatically when we use ANTLR to visit the tree. By making a derived class, and over-riding
those methods, we can determine what happens when we get to each part of the tree.
This is how we can implement the calculator using this approach:
public class CalculatorVisitor extends CalculatorParserBaseVisitor {
@Override
public Double visitTimesdiv(CalculatorParser.TimesdivContext ctx) {
double left = visit(ctx.expression(0));
double right = visit(ctx.expression(1));
if (ctx.op.getType() == CalculatorParser.TIMES) {
return left * right;
} else {
return left / right;
}
}
@Override
public Double visitPlusminus(CalculatorParser.PlusminusContext ctx) {
double left = visit(ctx.expression(0));
double right = visit(ctx.expression(1));
if (ctx.op.getType() == CalculatorParser.PLUS) {
return left + right;
} else {
return left - right;
}
}
@Override
public Double visitRealval(CalculatorParser.RealvalContext ctx) {
return Double.parseDouble(ctx.REALVAL().getText());
}
@Override
public Double visitPival(CalculatorParser.PivalContext ctx) {
return Math.PI;
}
@Override
public Double visitParens(CalculatorParser.ParensContext ctx) {
return visit(ctx.expression());
}
}
The class is templated by a type we choose to make all of the non-terminals represent. Here, each non-terminal (such as expression and term) represent double-precision floating point numbers.
Note we call the visit
method to evaluate the children
nodes.
Copyright © 2025 Ian Finlayson | Licensed under a Creative Commons BY-NC-SA 4.0 License.