Parsing with ANTLR

Overview

The two main approaches to writing a parser are:

By hand, usually with recursive descent
Using a parser generator

The benefit of writing a parser by hand is that you have more control, especially when it comes to error messages for input that doesn't parse properly. They can also be a bit faster.

The benefit of using a parser generator is that it's usually much easier. Tools like Bison and ANTLR allow more grammar constructs than recursive descent, so you don't need to modify the language grammar as much. The parser is also much easier to change when using a generator.

We will use the ANTLR parser generator which is written in Java, and has been developed over many years.

ANTLR Grammars

We have already seen how ANTLR can be used to generate a lexer. There are two ways we can use ANTLR for lexing and parsing:

We can use separate files for each. In this scheme, we will have one .g4 file which contains lexer grammar X; where X is the name of the file. Then we have another file which begins with parser grammar Y; where Y is the name of that file.
We can put both lexer and grammar rules into the same file. In this case, the file will contain just grammar Z; where the file is called Z.

In either case, all lexer rules must begin with capital letters (and are generally written in all caps) while parser rules must begin with lower-case letters. This way, ANTLR can determine which rules go with the lexical analysis vs. syntax analysis stages.

Grammar Rules

Traditionally the syntax of a language is described in a context-free grammar using Backus-Naur Form (BNF). BNF really only allows grammar rules where one grammar symbol can be replaced with one of a set of strings of other symbols:


statements: statement statements
         |
         ;

There is also extended BNF or EBNF, which allows for regex like rules that allow us to indicate that a symbol or set of symbols is optional, or may be repeated. ANTLR supports EBNF (though it uses different syntax), so we can re-write the above rule as:


statements: statement*;

ANTLR supports * (zero or more), + (one or more) and ? (zero or one) in its syntax rules notation.

ANTLR version 4 also allows the inclusion of direct left recursion, even though it generates a top-down parser. Thus the left-recursive statements rule above is valid in an ANTLR grammar. ANTLR supports this by re-writing our grammar for us to remove the left recursion before generating the parser.

It cannot do this for indirect left-recursion, so that is disallowed. Here is an example of indirect left recursive rules:


expr: term '+' term
    | term
    ;

term: factor '-' factor
    | expr
    ;

This is indirect left recursion because expr can expand to term on the left, and term in turn expands back to expr on the left. ANTLR gives us the following error in this case:

error(119): Test.g4::: The following sets of rules are mutually left-recursive [expr, term]

Example: Calculator

We can use ANTLR to create a calculator program using a lexer and parser. The lexer is in file CalculatorLexer.g4:


lexer grammar CalculatorLexer;

PLUS: '+';
MINUS: '-';
DIVIDE: '/';
TIMES: '*';
LPAREN: '(';
RPAREN: ')';

PI: 'pi' | 'PI';

REALVAL: [0-9]*'.'?[0-9]+;

WS: [ \t]+ -> skip;

Next we have the parser grammar that corresponds to this in CalculatorParser.g4:


parser grammar CalculatorParser;

options {
    tokenVocab = CalculatorLexer;
}

expression: expression (TIMES | DIVIDE) expression
          | expression (PLUS | MINUS) expression
          | term
          ;

term: REALVAL
    | PI
    | LPAREN expression RPAREN
    ;

We can compile these into the lexer and parser Java programs with the Antlr tool.

This parser simply parses the expressions, it does not do any calculation yet. There are two ways we can instrument this parser to actually do calculations. We can put semantic actions in place for each rule, or we can use ANTLR's visitor system instead.

Semantic Actions

Semantic actions consist of some bit of code that we want the parser to execute each time it applies one of our grammar rules in the parse. In a programming language parser, these could be used to do type checking, generate assembly code, or any other action that needs to happen while parsing. For our calculator, we can use these to do the actual calculation.

The directory calculator-actions contains the files needed for the complete calculator using these semantic actions. In particular, here is the grammar file:


line: expression 
    {
        System.out.println($expression.val);
    }
    ;

expression returns [double val]:
    a=expression op=(TIMES | DIVIDE) b=expression
        {
            if ($op.getType() == CalculatorParser.TIMES) {
                $val = $a.val * $b.val;
            } else {
                $val = $a.val / $b.val;
            }
        }
  | a=expression op=(PLUS | MINUS) b=expression
        {
            if ($op.getType() == CalculatorParser.PLUS) {
                $val = $a.val + $b.val;
            } else {
                $val = $a.val - $b.val;
            }
        }
  | term
        {
            $val = $term.val;
        }
  ;

term returns [double val]:
    num=REALVAL                     {$val = Double.parseDouble($num.getText());}
  | PI                              {$val = Math.PI;}
  | LPAREN expr=expression RPAREN   {$val = $expr.val;}
  ;

The grammar has been instrumented with semantic actions (which appear within curly braces). Each time one of the rules is expanded, we run the code in the corresponding action. The code within braces can be any Java code, and can use the $ variables to reference terminal and non-terminal values.

The advantage of semantic actions is they happen during the parse, making them efficient. As we parse we apply the actions at the same time. The downside is that it makes the grammar more complex looking and harder to reader. It also ties is into one set of actions. With a programming language, we might want to check types, interpret and do other analysis.

Listeners and Visitors

ANTLR 4 introduces a new way of doing this, in which we separate out the grammar itself from any actions that happen during the parse. The grammar only contains syntax rules. After parsing, ANTLR maintains the syntax tree for the input, and then can automatically walk the tree, using the listener or visitor to trigger our actions. The difference between these:

With the listener, the tree traversal happens automatically, without us needing to direct which children nodes in the tree to visit when.
With the visitor, we control if and when the children nodes are visited. Because this is necessary for doing our interpreter, we will stick with the visitor approach.

The directory calculator-visitor contains the files needed for this version of the calculator.

In this version, the parser can contain mostly just the grammar:


parser grammar CalculatorParser;

options {
    tokenVocab = CalculatorLexer;
}

expression: expression op=(TIMES | DIVIDE) expression      # timesdiv
          | expression op=(PLUS | MINUS) expression        # plusminus
          | term                                           # justterm
          ;

term: REALVAL                     # realval
    | PI                          # pival
    | LPAREN expression RPAREN    # parens
    ;

We do have the little tags on rules with multiple options, such as timesdiv and plusminus. These are needed for ANTLR to generate our visitor. But in this approach the grammar is de-coupled from what we use it for.

ANTLR uses this to make a Java class called CalculatorParserBaseVisitor which contains methods that will be called automatically when we use ANTLR to visit the tree. By making a derived class, and over-riding those methods, we can determine what happens when we get to each part of the tree.

This is how we can implement the calculator using this approach:


public class CalculatorVisitor extends CalculatorParserBaseVisitor {
    @Override
    public Double visitTimesdiv(CalculatorParser.TimesdivContext ctx) {
        double left = visit(ctx.expression(0));
        double right = visit(ctx.expression(1));

        if (ctx.op.getType() == CalculatorParser.TIMES) {
            return left * right;
        } else {
            return left / right;
        }
    }
    
    @Override
    public Double visitPlusminus(CalculatorParser.PlusminusContext ctx) {
        double left = visit(ctx.expression(0));
        double right = visit(ctx.expression(1));

        if (ctx.op.getType() == CalculatorParser.PLUS) {
            return left + right;
        } else {
            return left - right;
        }
    }

    @Override
    public Double visitRealval(CalculatorParser.RealvalContext ctx) {
        return Double.parseDouble(ctx.REALVAL().getText());
    }

    @Override
    public Double visitPival(CalculatorParser.PivalContext ctx) {
        return Math.PI;
    }
    
    
    @Override
    public Double visitParens(CalculatorParser.ParensContext ctx) {
        return visit(ctx.expression());
    }
}

The class is templated by a type we choose to make all of the non-terminals represent. Here, each non-terminal (such as expression and term) represent double-precision floating point numbers.

Note we call the visit method to evaluate the children nodes.