Home CPSC 401

Using Bison

Overview

Writing any kind of parser by hand is a tedious, error-prone process. It is also very difficult to maintain.

Parser generators make this process easier. A parser generator reads in an attribute grammar, and converts it into an parser for that grammar.

One widely-used parser generator is Yacc which was written in the early 70's at AT&T. There is a more modern version of it called Bison which we will use.


Bison Input File

The Bison input file is structured similarly to that of Flex:


%{

/* C header section */

%}

/* Bison definitions */


%%

/* grammar rules */

%%

/* C body section */

Bison definitions can include several things. One useful one is used to define tokens. This is done as follows:


%token IDENTIFIER

The convention is to name tokens in all caps (they are just constant numbers). We can also tell Bison which number to use as follows:


%token IDENTIFIER 100

These token symbols can then be used in our syntax rules as terminals.

Another useful definition is:


%error-verbose

This causes Bison to give more detailed error messages when parsing.


Syntax Rules

Syntax rules represent the rules in a context free grammar. In Bison, the nonterminals are just identifiers (using the same rules as for C). In place of an arrow → symbol, Bison uses a colon :. To the right of the colon is the string of terminals and non-terminals that match the rule. After that is the action enclosed in braces:


function: type IDENTIFIER LEFT_PARENS params RIGHT_PARENS    {}

If a nonterminal can expand to multiple productions, they can be given separately:


loop: FOR LEFT_PARENS expr SEMI expr SEMI expr RIGHT_PARENS stmt {}
loop: WHILE LEFT_PARENS expr RIGHT_PARENS stmt {}

Or with an | character:


loop: FOR LEFT_PARENS expr SEMI expr SEMI expr RIGHT_PARENS stmt {}
    | WHILE LEFT_PARENS expr RIGHT_PARENS stmt {}

Connecting Flex & Bison

Bison is designed to be used in conjunction with Flex. Flex provides the tokens that Bison will use for parsing.

The communication is achieved via return values from yylex function. Each time Bison needs a new token, it will call yylex.

We can also use the token identifiers directly in our lexer by including a header file that is generated by Bison.

When doing this, we don't need to have any main function in our lexer.


Example: Calculator

We can use Flex and Bison to write a simple calculator program. Our grammar can be based on the simple expression grammar below:


<expr> → <expr> + <term> | <expr> - <term> | <term>
<term> → <term> * <factor> | <term> / <factor> | <factor>
<factor> → const | ( <expr> )

This allows us to enter expressions like the following:


2 + 4
2 + (4 * 5)
3.5 - 1 - 2.0

In order to write the calculator, we can first write a lexer using Flex. It must match constant values, parenthesis, and the operators +, -, * and /.

The calculator lexer is given here.

The Bison parser file defines the token values that are used in the lexer, provides the grammar rules above, and provides a main function which simply calls yyparse. The parser is given here.

This parser simply parses the expressions, it does not do any calculation yet.


Compilation

Bison works by translating your specification file into C code. The general compilation process is given below:

We can compile the calculator using these commands:


flex lexer.l
bison -d parser.y
gcc lex.yy.c parser.tab.c -o calc

We could also use the makefile given here.


Actions

In order to have the parser do more than just parse the input, we need to associate some actions with the grammar.

Actions can be arbitrary C code, but they also can reference some special variables Bison creates for you:

These $ variables allow us to reference the various symbols in our rule. If we wanted to generate a parse tree while parsing we could use these like so:


expr: expr ADD term
    {
      $$ = make_tree($1, '+', $3);
    }

Symbol Types

The types of the $ variables can be controlled in the definition section of the Bison file. The set of all types your parser uses is given in the %union statement:


%union {
  type1 name1;
  type2 name2;
  /* etc. */
}

We can specify the types of terminals as part of the %token lines:


%token <name> TOKEN

The name used must match one of those in the union.

These can be assigned in the lexer using the yylval variable.

Non-terminal types are matched with the %type statement:


%type <name> nonterm

When generating code, Bison makes sure you use the types as they are expected to be used. The types can be built in ones or structures.


Example: Calculator II

This example extends the calculator to actually perform calculation. The lexer writes the values of constants in addition to returning a code for them.

The parser utilizes actions to calculate the values as the parse tree is traversed.


Conflicts

Using Bison, it can be easy to run into parser conflicts. This example shows both types of conflicts.

The shift/reduce condlict is caused by the addition of the "expr PLUS term PLUS term" production for the expr rule. When seeing something like "3 + 5" when the next token is "+", the parser won't know whether to shift the + or reduce the first part. With Bison, the default is to shift which is normally what you want.

The reduce/reduce conflict is caused by adding VAL as a possibility for term as well as factor. When seeing a VAL, the parser won't know whether to reduce it to a term or a factor.

The number of conflicts refers to parser state conflicts and is not the same as the actual number of problematic rules.

Copyright © 2018 Ian Finlayson | Licensed under a Creative Commons Attribution 4.0 International License.